
Hi, I'm currently working on a program that parses a large binary file and produces various textual outputs extracted from it. Simple enough. But: since we're talking large amounts of data, I'd like to have reasonable performance. Reading the binary file is very efficient thanks to Data.Binary. However, output is a different matter. Currently, my code looks something like: summarize :: Foo -> ByteString summarize f = let f1 = accessor f f2 = expression f : in B.concat [f1,pack "\t",pack (show f2),...] which isn't particularly elegant, and builds a temporary ByteString that usually only get passed to B.putStrLn. I can suffer the inelegance were it only fast - but this ends up taking the better part of the execution time. I tried to use lazy ByteStrings, the theory being that the components that already are (strict) ByteStrings could be recycled as chunks. I also tried to push the output down into the function (summarize :: Foo -> IO ()), but both of these were actuall slower. Since I surely can't be the first person that needs to output tab-separated text, I'd be grateful if somebody could point me in the right direction. -k -- If I haven't seen further, it is by standing in the footprints of giants

On Mon, Feb 9, 2009 at 12:49 PM, Ketil Malde
Reading the binary file is very efficient thanks to Data.Binary. However, output is a different matter. Currently, my code looks something like:
summarize :: Foo -> ByteString summarize f = let f1 = accessor f f2 = expression f : in B.concat [f1,pack "\t",pack (show f2),...]
which isn't particularly elegant, and builds a temporary ByteString that usually only get passed to B.putStrLn. I can suffer the inelegance were it only fast - but this ends up taking the better part of the execution time.
Is building the strict ByteString what takes the most time? If so, you might want to use `writev` to avoid extra copying. Does your data support incremental processing so that you could produce output before all input has been parsed? Cheers, Johan

Johan Tibell
Is building the strict ByteString what takes the most time?
Yes.
If so, you might want to use `writev` to avoid extra copying.
Is there a Haskell binding somewhere, or do I need to FFI the system call? Googling 'writev haskell' didn't turn up anything useful.
Does your data support incremental processing so that you could produce output before all input has been parsed?
Typically, yes. -k -- If I haven't seen further, it is by standing in the footprints of giants

On Mon, Feb 9, 2009 at 1:22 PM, Ketil Malde
Johan Tibell
writes: If so, you might want to use `writev` to avoid extra copying.
Is there a Haskell binding somewhere, or do I need to FFI the system call? Googling 'writev haskell' didn't turn up anything useful.
To my knowledge there's no binding out there. We will include one for sockets in the next release of network-bytestring. You might find the code here useful if you want to write your own: http://github.com/tibbe/network-bytestring/blob/c13d8fab5179e6afbcdebac95d49... Cheers, Johan

Hello Ketil, Monday, February 9, 2009, 2:49:05 PM, you wrote:
in B.concat [f1,pack "\t",pack (show f2),...]
inelegance were it only fast - but this ends up taking the better part of the execution time.
i'm not a BS expert but it seems that you produce Strings using show and then convert them to BS. of course this is inefficient - you need to replace show with BS analog -- Best regards, Bulat mailto:Bulat.Ziganshin@gmail.com

+1; it's obviously the packing that causes sloth.
Memoize the "pack "\t"" etc. stuff , and write bytestring replacements
for show for your data.
I guess you can use the Put monad instead of B.concat for that, by the way.
2009/2/9 Bulat Ziganshin
Hello Ketil,
Monday, February 9, 2009, 2:49:05 PM, you wrote:
in B.concat [f1,pack "\t",pack (show f2),...]
inelegance were it only fast - but this ends up taking the better part of the execution time.
i'm not a BS expert but it seems that you produce Strings using show and then convert them to BS. of course this is inefficient - you need to replace show with BS analog
-- Best regards, Bulat mailto:Bulat.Ziganshin@gmail.com
_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

Bulat Ziganshin
in B.concat [f1,pack "\t",pack (show f2),...]
i'm not a BS expert but it seems that you produce Strings using show and then convert them to BS. of course this is inefficient - you need to replace show with BS analog
Do these analogous functions exist, or must I roll my own. I've also looked a bit at Data.Binary.Builder, perhaps this is the way to go? Will look more closely. -k -- If I haven't seen further, it is by standing in the footprints of giants

On Mon, 2009-02-09 at 12:49 +0100, Ketil Malde wrote:
Hi,
I'm currently working on a program that parses a large binary file and produces various textual outputs extracted from it. Simple enough.
But: since we're talking large amounts of data, I'd like to have reasonable performance.
Reading the binary file is very efficient thanks to Data.Binary. However, output is a different matter. Currently, my code looks something like:
Have you considered using Data.Binary to output the data too? It has a pretty efficient underlying monoid for accumulating output data in a buffer. You'd want some wrapper functions over the top to make it a bit nicer for your use case, but it should work and should be quick. It generates a lazy bytestring, but does so with a few large chunks so the IO will still be quick. Duncan

Duncan Coutts
Have you considered using Data.Binary to output the data too? It has a pretty efficient underlying monoid for accumulating output data in a buffer. You'd want some wrapper functions over the top to make it a bit nicer for your use case, but it should work and should be quick.
I've used Data.Binary.Builder to generate the output, which is quite nice as an interface. Currently, I've managed to shave off a few percent off the time - nothing radical yet, but there's a lot of room for tuning various convenience functions in there. -k -- If I haven't seen further, it is by standing in the footprints of giants

ketil:
Hi,
I'm currently working on a program that parses a large binary file and produces various textual outputs extracted from it. Simple enough.
But: since we're talking large amounts of data, I'd like to have reasonable performance.
Reading the binary file is very efficient thanks to Data.Binary. However, output is a different matter. Currently, my code looks something like:
summarize :: Foo -> ByteString summarize f = let f1 = accessor f f2 = expression f : in B.concat [f1,pack "\t",pack (show f2),...]
which isn't particularly elegant, and builds a temporary ByteString that usually only get passed to B.putStrLn. I can suffer the inelegance were it only fast - but this ends up taking the better part of the execution time.
Why not use Data.Binary for output too? It is rather efficient at output -- using a continuation-like system to fill buffers gradually. -- Don
participants (6)
-
Bulat Ziganshin
-
Don Stewart
-
Duncan Coutts
-
Eugene Kirpichov
-
Johan Tibell
-
Ketil Malde