newbie question about list performance

Hello, I've been following the list optimization thread with great interest, as it pertains to something I'm working on at the moment. I'm working with moderate-sized files (tens to hundreds of MBs) that have some ascii header data followed by a bunch of 32-bit ints. I can read the files into a lazy ByteString (and parse the header), but I'd like some advice as to the best data type to convert the ints into. Ideally, I'd have some functions like this: decode :: ByteString -> (FileFormat, [Int32]) encode :: FileFormat -> [Int32] -> ByteString but I don't know if Int32 is actually the best choice. It seems to me that something like a lazy list of strict arrays (analogous to a lazy bytestring) would be better. Is there a library like this already? Or is this a case of premature optimization, and I should just try the list and see if it's good enough? Any suggestions would be appreciated. Also, I'd like to let the maintainers and implementors know that I really appreciate the work that's been done on optimizing Haskell. I haven't used Haskell much yet, but I've fallen in love with the language and it's great to see that performance even for heavy I/O tasks can be comparable to or exceed C. Thank you, John Lato

jwlato:
Hello, I've been following the list optimization thread with great interest, as it pertains to something I'm working on at the moment. I'm working with moderate-sized files (tens to hundreds of MBs) that have some ascii header data followed by a bunch of 32-bit ints. I can read the files into a lazy ByteString (and parse the header), but I'd like some advice as to the best data type to convert the ints into. Ideally, I'd have some functions like this: decode :: ByteString -> (FileFormat, [Int32]) encode :: FileFormat -> [Int32] -> ByteString
but I don't know if Int32 is actually the best choice. It seems to me that something like a lazy list of strict arrays (analogous to a lazy bytestring) would be better. Is there a library like this already? Or is this a case of premature optimization, and I should just try the list and see if it's good enough? Any suggestions would be appreciated.
could you use Data.Binary.encode/decode (with custom put and get instances)? They read fomr lazy bytestrings into custom structures, such as arrays. http://hackage.haskell.org/cgi-bin/hackage-scripts/package/binary-0.4.1
Also, I'd like to let the maintainers and implementors know that I really appreciate the work that's been done on optimizing Haskell. I haven't used Haskell much yet, but I've fallen in love with the language and it's great to see that performance even for heavy I/O tasks can be comparable to or exceed C.
Yay! -- Don

John Lato wrote:
I'm working with moderate-sized files (tens to hundreds of MBs) that have some ascii header data followed by a bunch of 32-bit ints.
but I don't know if [Int32] is actually the best choice. It seems to me that something like a lazy list of strict arrays (analogous to a lazy bytestring) would be better.
Depends on your data access pattern. If you access the words strictly linearly, from the beginning of the file to the end, and that's all, then [Int32] is absolutely fine. A list is a data-structure equivalent of a for loop; it's the correct structure if you are dealing with things linearly or nearly-linearly. If you were using adjacent words together, that would be fine too (as in, e.g., zip xs (tail xs)). If your data access pattern is more scattered or random-access in style, then [Int32] does not scale well to 10s of MBs. If you keep the data around, the overhead for [] is inappropriate (around 600-800% memory usage overhead on [Int32]) and its performance guarantees are not good either, for random access. In this case, as a first approximation, I would be inclined to try a library which simple backended onto lazy bytestring. For example the 'index' operation to fetch a single word would fetch four bytes and bit-twiddle them into a word. If that doesn't give the high speed you're after, then perhaps something *like* LBS, i.e. foreignptr behind the scenes, but directly accessing word-at-a-time. Jules
participants (3)
-
Don Stewart
-
John Lato
-
Jules Bean