You need a little bit of book-keeping in case a backslash and newline straddle the chunk boundary but it bundles up and gets hidden from you by the monoid.
As for the triple quoting or mixed mode languages, I have three options that I've explored:
1.) I have a set of parser combinators that I've been working on that parse fully monoidally, which are sufficient for context-free attribute grammars which can tackle that problem head on. But you give up the monadic sugar that makes parsec so syntactically sweet. (Though I need to change its name since another project named parsimony just appeared on hackage!)
2.) You can define a reversible lexer, whereby given a list of tokens you can generate the original input string, but then you spend a lot of time gluing string fragments together when you started with a perfectly good bytestring.
3.) My favorite option, using the iteratee backed parsec parser from the post you can 'slice' the input bytestring between two cursor positions efficiently. When you start parsing in a superposition of states and the only marker flips your mode rather than cleanly starts or starts it, especially when one is boring like a triple quoted string, just mark the location of the start of your lexer, and maintain two token lists, flipping their roles at each transition. All you wind up doing is grabbing a few extra bytestring slices and lexing the body of your string using the general purpose lexer just in case. Nicely if you have edge cases that cannot occur in one sublanguage or the other, then you can use that to collapse out of the superposition of states into a single state. I'm using this approach for parsing CDATA sections in XML monoidally, which has the same issue.
-Edward Kmett
On Wed, Sep 9, 2009 at 3:58 PM, Anakim Border
<akborder@gmail.com> wrote:
Hi Edward,
I read your slides a few weeks ago when they were posted on Reddit and
found your approach very interesting. In fact it provided the
inspiration for the parser I've written.
I did not use the same strategy, however, because parsing makefiles
poses its own challenges. The make syntax is actually the union of two
different languages, one for defining targets and one for commands.
It's easy to identify commands following target definitions (they
belong to lines starting with a tab), but if you jump to the middle of
a makefile you can't decide, for example, if you are inside a define
block (that should be parsed like a command) because the lines in its
body are not marked by any specific prefix. Thus the need for a
sequential scan. All this may seem rather involved and specific to an
unusual format, but one encounters the same problem when parsing
here-documents in shell scripts or multi line (triple-quoted) comments
in Python.
The cache trashing effect you mention is certainly an issue I did not
foresee. I guess a possible solution would be to base parMap on
different 'par' primitive; one that sparks a number of computations
equal to the number of available OS threads and then blocks until at
least one of them is done.
-- AB