
ekmett:
Hi Anakim, Nice to see someone else working in this space. I have also been working on a set of parallel parsing techniques, which can use small Parsec parsers for local context sensitivity. See the second set of slides in http://comonad.com/reader/2009/ iteratees-parsec-and-monoid/ for an overview of how I'm doing something similar to feed Parsec independent chunks. Note that this approach bypasses the need for a separate sequential scan, which otherwise floods your cache, and lets you get closer to the performance limit imposed by Amdahl's law. The code in the second set of slides can be adapted to your case: load everything into a lazy bytestring or fingertree of strict bytestrings, then for each strict bytestring chunk in parallel, scan it for the first newline, and then start an iteratee based parsec parser from that point. I use the iteratee based parsec parsers so that when I glue the partial parses together I can feed the unparsed data on the left side of the first newline in each chunk to the parser I'm joining on the left. I provide a monoid for the purpose of gluing together these partial parses, which encapsulates this behavior.
You can get quite a long way by using bytestring-mmap and strict bytestrings. The first ensures your IO overhead will be low, and the second ensures you don't migrate work needlessly.