Parallel parsing & multicore

=_a6OH Text.Parsec.Prim 20.6 13.0 tokenPrimEx Text.Parsec.Prim 16.5 20.8 fmap_a6Qj Text.Parsec.Prim 15.5 3.6
Hi, I've put together a parser that works like this: - the full input is read into a strict ByteString; - such string is split into a number of lines; - each line is fed (independently) to a parser based on Parsec (version 3). Running the serial version of the parser (compiled with -O2 optimizations) on a rather large file, I get the following: $ ./Parser +RTS -s 2,175,464,100 bytes allocated in the heap 49,293,632 bytes copied during GC 5,297,560 bytes maximum residency (6 sample(s)) 1,406,800 bytes maximum slop 16 MB total memory in use (0 MB lost due to fragmentation) Generation 0: 4140 collections, 0 parallel, 0.39s, 0.42s elapsed Generation 1: 6 collections, 0 parallel, 0.05s, 0.07s elapsed INIT time 0.00s ( 0.00s elapsed) MUT time 1.93s ( 1.99s elapsed) GC time 0.44s ( 0.48s elapsed) EXIT time 0.00s ( 0.00s elapsed) Total time 2.37s ( 2.47s elapsed) %GC time 18.7% (19.5% elapsed) Alloc rate 1,127,336,508 bytes per MUT second Productivity 81.2% of total user, 78.1% of total elapsed The profile report: COST CENTRE MODULE %time %alloc parserBind Text.Parsec.Prim 12.4 9.3 mplus_a6KS Text.Parsec.Prim 10.3 10.6 notFollowedBy Text.Parsec.Combinator 3.1 4.4 [...] shows that the biggest part of the execution time is spent in the parser (not in the read & line split operations, as expected). Given that each line is parsed independently, I used the following code: map parse lines `using` parListChunk 250 rnf to spark a separate computation for each group of 250 lines. I made the datatype used by the parser an instance of NFData, so that the "rnf" strategy should force a complete evaluation. The threaded version running on 2 cores is moderately faster than the serial one: $ ./Parser +RTS -s -N2 2,377,165,256 bytes allocated in the heap 36,320,944 bytes copied during GC 6,020,720 bytes maximum residency (6 sample(s)) 6,933,928 bytes maximum slop 21 MB total memory in use (0 MB lost due to fragmentation) Generation 0: 2410 collections, 0 parallel, 0.33s, 0.34s elapsed Generation 1: 6 collections, 4 parallel, 0.06s, 0.05s elapsed Parallel GC work balance: 1.83 (2314641 / 1265968, ideal 2) Task 0 (worker) : MUT time: 2.43s ( 1.19s elapsed) GC time: 0.02s ( 0.02s elapsed) Task 1 (worker) : MUT time: 2.15s ( 1.19s elapsed) GC time: 0.29s ( 0.30s elapsed) Task 2 (worker) : MUT time: 2.37s ( 1.19s elapsed) GC time: 0.07s ( 0.08s elapsed) Task 3 (worker) : MUT time: 2.45s ( 1.19s elapsed) GC time: 0.00s ( 0.00s elapsed) INIT time 0.00s ( 0.00s elapsed) MUT time 2.06s ( 1.19s elapsed) GC time 0.39s ( 0.39s elapsed) EXIT time 0.00s ( 0.00s elapsed) Total time 2.45s ( 1.58s elapsed) %GC time 15.7% (24.9% elapsed) Alloc rate 1,151,990,234 bytes per MUT second Productivity 84.2% of total user, 130.2% of total elapsed The speedup is smaller than what I was expecting given that each unit of work (250 input lines) is completely independent from the others. Changing the size of each work unit did not help; garbage collection times are small enough that increasing the minimum heap size did not produce any speedup either. Is there anything else I can do to understand why the parallel map does not provide a significant speedup? Thanks, AB

akborder:
The threaded version running on 2 cores is moderately faster than the serial one:
$ ./Parser +RTS -s -N2 2,377,165,256 bytes allocated in the heap 36,320,944 bytes copied during GC 6,020,720 bytes maximum residency (6 sample(s)) 6,933,928 bytes maximum slop 21 MB total memory in use (0 MB lost due to fragmentation)
Generation 0: 2410 collections, 0 parallel, 0.33s, 0.34s elapsed Generation 1: 6 collections, 4 parallel, 0.06s, 0.05s elapsed
Parallel GC work balance: 1.83 (2314641 / 1265968, ideal 2)
Task 0 (worker) : MUT time: 2.43s ( 1.19s elapsed) GC time: 0.02s ( 0.02s elapsed)
Task 1 (worker) : MUT time: 2.15s ( 1.19s elapsed) GC time: 0.29s ( 0.30s elapsed)
Task 2 (worker) : MUT time: 2.37s ( 1.19s elapsed) GC time: 0.07s ( 0.08s elapsed)
Task 3 (worker) : MUT time: 2.45s ( 1.19s elapsed) GC time: 0.00s ( 0.00s elapsed)
INIT time 0.00s ( 0.00s elapsed) MUT time 2.06s ( 1.19s elapsed) GC time 0.39s ( 0.39s elapsed) EXIT time 0.00s ( 0.00s elapsed) Total time 2.45s ( 1.58s elapsed)
%GC time 15.7% (24.9% elapsed)
Alloc rate 1,151,990,234 bytes per MUT second
Productivity 84.2% of total user, 130.2% of total elapsed
The speedup is smaller than what I was expecting given that each unit of work (250 input lines) is completely independent from the others. Changing the size of each work unit did not help; garbage collection times are small enough that increasing the minimum heap size did not produce any speedup either.
Is there anything else I can do to understand why the parallel map does not provide a significant speedup?
Very interesting idea! I think the big thing would be to measure it with GHC HEAD so you can see how effectively the sparks are being converted into threads. Is there a package and test case somewhere we can try out? -- Don

Very interesting idea!
I think the big thing would be to measure it with GHC HEAD so you can see how effectively the sparks are being converted into threads.
Is there a package and test case somewhere we can try out?
At this point the parser is just a proof of concept. For those brave enough, however, I've put the code on github: http://github.com/akborder/HsMakefileParser/ The "test.mk" file provides some test cases. To get an input big enough to measure multiple thread performances, you can concatenate that file a few thousand times: the timings in my previous message were obtained parsing a 3000x concatenation (final size: 1.1 MB).

Hi Anakim,
Nice to see someone else working in this space.
I have also been working on a set of parallel parsing techniques, which can
use small Parsec parsers for local context sensitivity.
See the second set of slides in
http://comonad.com/reader/2009/iteratees-parsec-and-monoid/ for an overview
of how I'm doing something similar to feed Parsec independent chunks. Note
that this approach bypasses the need for a separate sequential scan, which
otherwise floods your cache, and lets you get closer to the performance
limit imposed by Amdahl's law.
The code in the second set of slides can be adapted to your case: load
everything into a lazy bytestring or fingertree of strict bytestrings, then
for each strict bytestring chunk in parallel, scan it for the first newline,
and then start an iteratee based parsec parser from that point. I use the
iteratee based parsec parsers so that when I glue the partial parses
together I can feed the unparsed data on the left side of the first newline
in each chunk to the parser I'm joining on the left. I provide a monoid for
the purpose of gluing together these partial parses, which encapsulates this
behavior.
The fingertree case is particularly nice, because it can be used to do
cheap incremental reparsing using the same machinery based on out of band
updates to the input.
This approach is sufficient to parse a lot of interesting languages. As you
have noted with Makefiles it can handle indentation based control
structures, and with a variation on a Dyck language monoid it can be
extended to Haskell-style layout or parenthesis matching/lisp parsing by
applying the same techniques at a higher level.
-Edward Kmett
On Wed, Sep 9, 2009 at 10:42 AM, Anakim Border
Very interesting idea!
I think the big thing would be to measure it with GHC HEAD so you can see how effectively the sparks are being converted into threads.
Is there a package and test case somewhere we can try out?
At this point the parser is just a proof of concept. For those brave enough, however, I've put the code on github:
http://github.com/akborder/HsMakefileParser/
The "test.mk" file provides some test cases. To get an input big enough to measure multiple thread performances, you can concatenate that file a few thousand times: the timings in my previous message were obtained parsing a 3000x concatenation (final size: 1.1 MB). _______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

ekmett:
Hi Anakim, Nice to see someone else working in this space. I have also been working on a set of parallel parsing techniques, which can use small Parsec parsers for local context sensitivity. See the second set of slides in http://comonad.com/reader/2009/ iteratees-parsec-and-monoid/ for an overview of how I'm doing something similar to feed Parsec independent chunks. Note that this approach bypasses the need for a separate sequential scan, which otherwise floods your cache, and lets you get closer to the performance limit imposed by Amdahl's law. The code in the second set of slides can be adapted to your case: load everything into a lazy bytestring or fingertree of strict bytestrings, then for each strict bytestring chunk in parallel, scan it for the first newline, and then start an iteratee based parsec parser from that point. I use the iteratee based parsec parsers so that when I glue the partial parses together I can feed the unparsed data on the left side of the first newline in each chunk to the parser I'm joining on the left. I provide a monoid for the purpose of gluing together these partial parses, which encapsulates this behavior.
You can get quite a long way by using bytestring-mmap and strict bytestrings. The first ensures your IO overhead will be low, and the second ensures you don't migrate work needlessly.

On Wed, Sep 9, 2009 at 12:16 PM, Don Stewart
Hi Anakim,
Nice to see someone else working in this space.
I have also been working on a set of parallel parsing techniques, which can use small Parsec parsers for local context sensitivity.
See the second set of slides in http://comonad.com/reader/2009/ iteratees-parsec-and-monoid/ for an overview of how I'm doing something similar to feed Parsec independent chunks. Note that this approach bypasses the need for a separate sequential scan, which otherwise floods your cache, and lets you get closer to the performance limit imposed by Amdahl's law.
The code in the second set of slides can be adapted to your case: load everything into a lazy bytestring or fingertree of strict bytestrings,
each strict bytestring chunk in parallel, scan it for the first newline, and then start an iteratee based parsec parser from that point. I use the iteratee based parsec parsers so that when I glue the partial parses together I can feed the unparsed data on the left side of the first newline in each chunk to
ekmett: then for the
parser I'm joining on the left. I provide a monoid for the purpose of gluing together these partial parses, which encapsulates this behavior.
You can get quite a long way by using bytestring-mmap and strict bytestrings. The first ensures your IO overhead will be low, and the second ensures you don't migrate work needlessly.
I'm actually using bytestring-mmap in a toy compiler which uses these techniques, though at present I'm using System.IO.Posix.MMap.Lazy rather than the strict version. As for migrating work, maybe I'm not understanding your point, but my whole goal was to migrate the chunking and parsing out to other cores, so the behavior of parMap rwhnf'ing the monoidal reduction over the chunks seems to be exactly what I want. I could perhaps get better results by trying to assign the same thread contiguous ranges of the input though, and controlling the choice of core used to merge chunks more intelligently by maybe doing something like mixing up pseq and par to group runs onto the same core where possible. Could you elaborate? -Edward

ekmett:
On Wed, Sep 9, 2009 at 12:16 PM, Don Stewart
wrote: ekmett: > Hi Anakim, > > Nice to see someone else working in this space. > > I have also been working on a set of parallel parsing techniques, which can use > small Parsec parsers for local context sensitivity. > > See the second set of slides in http://comonad.com/reader/2009/ > iteratees-parsec-and-monoid/ for an overview of how I'm doing something similar > to feed Parsec independent chunks. Note that this approach bypasses the need > for a separate sequential scan, which otherwise floods your cache, and lets you > get closer to the performance limit imposed by Amdahl's law. > > The code in the second set of slides can be adapted to your case: load > everything into a lazy bytestring or fingertree of strict bytestrings, then for > each strict bytestring chunk in parallel, scan it for the first newline, and > then start an iteratee based parsec parser from that point. I use the iteratee > based parsec parsers so that when I glue the partial parses together I can feed > the unparsed data on the left side of the first newline in each chunk to the > parser I'm joining on the left. I provide a monoid for the purpose of gluing > together these partial parses, which encapsulates this behavior.
You can get quite a long way by using bytestring-mmap and strict bytestrings. The first ensures your IO overhead will be low, and the second ensures you don't migrate work needlessly.
I'm actually using bytestring-mmap in a toy compiler which uses these techniques, though at present I'm using System.IO.Posix.MMap.Lazy rather than the strict version. As for migrating work, maybe I'm not understanding your point, but my whole goal was to migrate the chunking and parsing out to other cores, so the behavior of parMap rwhnf'ing the monoidal reduction over the chunks seems to be exactly what I want. I could perhaps get better results by trying to assign the same thread contiguous ranges of the input though, and controlling the choice of core used to merge chunks more intelligently by maybe doing something like mixing up pseq and par to group runs onto the same core where possible.
Oh, I just find strict structures easier when determining in which thread work is happening. Appropriate use of rnf and friends is also fine. -- Don

Hi Edward, I read your slides a few weeks ago when they were posted on Reddit and found your approach very interesting. In fact it provided the inspiration for the parser I've written. I did not use the same strategy, however, because parsing makefiles poses its own challenges. The make syntax is actually the union of two different languages, one for defining targets and one for commands. It's easy to identify commands following target definitions (they belong to lines starting with a tab), but if you jump to the middle of a makefile you can't decide, for example, if you are inside a define block (that should be parsed like a command) because the lines in its body are not marked by any specific prefix. Thus the need for a sequential scan. All this may seem rather involved and specific to an unusual format, but one encounters the same problem when parsing here-documents in shell scripts or multi line (triple-quoted) comments in Python. The cache trashing effect you mention is certainly an issue I did not foresee. I guess a possible solution would be to base parMap on different 'par' primitive; one that sparks a number of computations equal to the number of available OS threads and then blocks until at least one of them is done. -- AB

Actually you can still do the same trick, you just need a smarter delimiter
-- ironically the same one I use in Kata. ;) Look for non-backslashed
newlines rather than newlines in general and start there.
You need a little bit of book-keeping in case a backslash and newline
straddle the chunk boundary but it bundles up and gets hidden from you by
the monoid.
As for the triple quoting or mixed mode languages, I have three options that
I've explored:
1.) I have a set of parser combinators that I've been working on that parse
fully monoidally, which are sufficient for context-free attribute grammars
which can tackle that problem head on. But you give up the monadic sugar
that makes parsec so syntactically sweet. (Though I need to change its name
since another project named parsimony just appeared on hackage!)
2.) You can define a reversible lexer, whereby given a list of tokens you
can generate the original input string, but then you spend a lot of time
gluing string fragments together when you started with a perfectly good
bytestring.
3.) My favorite option, using the iteratee backed parsec parser from the
post you can 'slice' the input bytestring between two cursor positions
efficiently. When you start parsing in a superposition of states and the
only marker flips your mode rather than cleanly starts or starts it,
especially when one is boring like a triple quoted string, just mark the
location of the start of your lexer, and maintain two token lists, flipping
their roles at each transition. All you wind up doing is grabbing a few
extra bytestring slices and lexing the body of your string using the general
purpose lexer just in case. Nicely if you have edge cases that cannot occur
in one sublanguage or the other, then you can use that to collapse out of
the superposition of states into a single state. I'm using this approach for
parsing CDATA sections in XML monoidally, which has the same issue.
-Edward Kmett
On Wed, Sep 9, 2009 at 3:58 PM, Anakim Border
Hi Edward,
I read your slides a few weeks ago when they were posted on Reddit and found your approach very interesting. In fact it provided the inspiration for the parser I've written.
I did not use the same strategy, however, because parsing makefiles poses its own challenges. The make syntax is actually the union of two different languages, one for defining targets and one for commands. It's easy to identify commands following target definitions (they belong to lines starting with a tab), but if you jump to the middle of a makefile you can't decide, for example, if you are inside a define block (that should be parsed like a command) because the lines in its body are not marked by any specific prefix. Thus the need for a sequential scan. All this may seem rather involved and specific to an unusual format, but one encounters the same problem when parsing here-documents in shell scripts or multi line (triple-quoted) comments in Python.
The cache trashing effect you mention is certainly an issue I did not foresee. I guess a possible solution would be to base parMap on different 'par' primitive; one that sparks a number of computations equal to the number of available OS threads and then blocks until at least one of them is done.
-- AB

Hello Anakim, Wednesday, September 9, 2009, 11:58:58 PM, you wrote:
foresee. I guess a possible solution would be to base parMap on different 'par' primitive; one that sparks a number of computations equal to the number of available OS threads and then blocks until at least one of them is done.
actually you should ask RTS -N value, that should be available via internal RTS structures. and then user, optionally, may link in a code that sets RTS -N to the number of hardware threads -- Best regards, Bulat mailto:Bulat.Ziganshin@gmail.com

Hello Anakim, Wednesday, September 9, 2009, 5:35:32 PM, you wrote:
MUT time 1.93s ( 1.99s elapsed)
MUT time 2.06s ( 1.19s elapsed)
the speedup is very good. usually you don't get 2x faster execution because cores compete for the access to memory ps: are you sure that you correctly interpret the results? wall times was reduced from 1.99s to 1.19s -- Best regards, Bulat mailto:Bulat.Ziganshin@gmail.com
participants (4)
-
Anakim Border
-
Bulat Ziganshin
-
Don Stewart
-
Edward Kmett