
On Tue, Nov 10, 2009 at 8:20 PM, Gokul P. Nair
General notes:
* unpack is almost always wrong. * list indexing with !! is almost always wrong. * words/lines are often wrong for parsing large files (they build large
* toList/fromList probably aren't the best strategy * sortBy (comparing snd) * use insertWith' Spefically, avoid constructing intermediate lists, when you can process
entire file in a single pass. Use O(1) bytestring substring operations
--- On Sat, 11/7/09, Don Stewart
wrote: list structures). the like take and drop.
Thanks all for the valuable feedback. Switching from Regex.Posix to Regex.PCRE alone reduced the running time to about 6 secs and a few other optimizations suggested on this thread brought it down to about 5 secs ;)
I then set out to profile the code out of curiosity to see where the bulk of the time was being spent and sure enough the culprit turned out to be "unpack". My question therefore is, given a list L1 of type [(ByteString, Int)], how do I print it out so as to eliminate the "chunk, empty" markers associated with a bytestring? The suggestions posted here are along the lines of "mapM_ print L1" but that's far from desirable especially because the generated output is for perusal by non-technical users etc.
Thanks.
Take a look at Data.ByteString.Lazy.Char8.putStrLn. That prints a lazy ByteString without unpacking it, and without the internal markers. Sincerely, Brad