
Dear Haskellers, megaparsec [1] is a parsing library with excellent error reporting. Being able to emit decent parse error messages is traded off against speed, however. Wouldn't it be nice to have a faster parser for your daily parsing needs, but obtain megaparsec's informative error messages for the rare cases when things go wrong? Thanks to megaparsec's parser type class, MonadParsec, you can have both: I wrote a MonadParsec instance for StateT s Maybe (s being the input stream type such as Text or ByteString) which does not do all the bookkeeping ParsecT needs to do. If you manage to write your parser only using the type class primitives, then you can specialize your parser to StateT s Maybe. Only if that returns Nothing specialize again to ParsecT Void s and parse the input again, thus obtaining a decent error message. The package faster-megaparsec [1,2] provides a convenient function to do just that. Of course StateT s Maybe can not provide all the features ParsecT has: * StateT s Maybe is always backtracking. * It does not know the offset in the stream. * It must hold the entire input stream in memory for a second pass, so no garbage collecting the consumed input while parsing. So if your parsing relies on non-backtracking choice, stream offset or garbage-collecting input while parsing, you can not use faster- megaparsec as a drop-in replacement. Happy parsing! Olaf [1] https://hackage.haskell.org/package/megaparsec [2] https://hackage.haskell.org/package/faster-megaparsec [3] https://hub.darcs.net/olf/faster-megaparsec

Do you provide reproducible benchmarks for these claims? I'd love to see the outcome for various types of projects to make the choice for myself. Unsubstantiated claims of "This is Faster™️" are the path to shame and treachery. On Tue, Nov 08, 2022 at 1:31 PM, Olaf Klinke < olf@aatal-apotheke.de > wrote:
Dear Haskellers,
megaparsec [1] is a parsing library with excellent error reporting. Being able to emit decent parse error messages is traded off against speed, however. Wouldn't it be nice to have a faster parser for your daily parsing needs, but obtain megaparsec's informative error messages for the rare cases when things go wrong?
Thanks to megaparsec's parser type class, MonadParsec, you can have both: I wrote a MonadParsec instance for StateT s Maybe (s being the input stream type such as Text or ByteString) which does not do all the bookkeeping ParsecT needs to do. If you manage to write your parser only using the type class primitives, then you can specialize your parser to StateT s Maybe. Only if that returns Nothing specialize again to ParsecT Void s and parse the input again, thus obtaining a decent error message. The package faster-megaparsec [1,2] provides a convenient function to do just that.
Of course StateT s Maybe can not provide all the features ParsecT has: * StateT s Maybe is always backtracking. * It does not know the offset in the stream. * It must hold the entire input stream in memory for a second pass, so no garbage collecting the consumed input while parsing. So if your parsing relies on non-backtracking choice, stream offset or garbage-collecting input while parsing, you can not use faster- megaparsec as a drop-in replacement.
Happy parsing! Olaf
[1] https:/ / hackage. haskell. org/ package/ megaparsec ( https://hackage.haskell.org/package/megaparsec ) [2] https:/ / hackage. haskell. org/ package/ faster-megaparsec ( https://hackage.haskell.org/package/faster-megaparsec ) [3] https://hub.darcs.net/olf/faster-megaparsec
_______________________________________________ Haskell-Cafe mailing list To (un)subscribe, modify options or view archives go to: http:/ / mail. haskell. org/ cgi-bin/ mailman/ listinfo/ haskell-cafe ( http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe ) Only members subscribed via the mailman list are allowed to post.

Hi,
Recently I was improving performance of simple-sql-parser library which
sits on top of megaparsec.
Main issue was memory consumption, not speed. Machine generated SQL query
might weigh megabytes.
I got success with a deduplicating cache for lexer tokens.
A complex parser produces very chunky lazy text.
To avoid traversing these linked lists (cache locality) and copying to
strict one
- offsets in the input Text are analyzed before and after success return
from parser => shallow substring which is deduplicated with high
probability and no extra memcpy.
Another interesting idea, I would like to try, is to use ByteString for
text input and mmap input byte string.
SQL query is 99% ASCII only and all keywords are ASCII bytes.
So knowing that constraint parser doesn't spend time on reconstructing
characters out of impossible multibyte input,
when it expects keywords.
Megaparsec doesn't have digital search tree, which help a lot with avoiding
back tracking.
Thanks
On Tue, Nov 8, 2022 at 3:32 PM Olaf Klinke
Dear Haskellers,
megaparsec [1] is a parsing library with excellent error reporting. Being able to emit decent parse error messages is traded off against speed, however. Wouldn't it be nice to have a faster parser for your daily parsing needs, but obtain megaparsec's informative error messages for the rare cases when things go wrong?
Thanks to megaparsec's parser type class, MonadParsec, you can have both: I wrote a MonadParsec instance for StateT s Maybe (s being the input stream type such as Text or ByteString) which does not do all the bookkeeping ParsecT needs to do. If you manage to write your parser only using the type class primitives, then you can specialize your parser to StateT s Maybe. Only if that returns Nothing specialize again to ParsecT Void s and parse the input again, thus obtaining a decent error message. The package faster-megaparsec [1,2] provides a convenient function to do just that.
Of course StateT s Maybe can not provide all the features ParsecT has: * StateT s Maybe is always backtracking. * It does not know the offset in the stream. * It must hold the entire input stream in memory for a second pass, so no garbage collecting the consumed input while parsing. So if your parsing relies on non-backtracking choice, stream offset or garbage-collecting input while parsing, you can not use faster- megaparsec as a drop-in replacement.
Happy parsing! Olaf
[1] https://hackage.haskell.org/package/megaparsec [2] https://hackage.haskell.org/package/faster-megaparsec [3] https://hub.darcs.net/olf/faster-megaparsec
_______________________________________________ Haskell-Cafe mailing list To (un)subscribe, modify options or view archives go to: http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe Only members subscribed via the mailman list are allowed to post.
-- Best regards, Daniil Iaitskov

On Mon, 2022-11-14 at 22:34 -0500, Daneel Yaitskov wrote:
Hi,
Recently I was improving performance of simple-sql-parser library which sits on top of megaparsec. Main issue was memory consumption, not speed. Machine generated SQL query might weigh megabytes. Parsing megabytes of input should be no problem. The Haskell representation of the highly structured syntax is likely much larger.
I got success with a deduplicating cache for lexer tokens.
Aha, so the input stream to parse is a stream of pointers, where identical tokens point to the same memory reference? It's not in your fork [1] yet, is it? The same trick might be possible with a Lempel-Ziv parse as well (so you run your parser on compressed input). The primitive stream operations like uncons become more expensive than usual, though. [1] https://github.com/yaitskov/simple-sql-parser/blob/master/Language/SQL/Simpl...
A complex parser produces very chunky lazy text. To avoid traversing these linked lists (cache locality) and copying to strict one - offsets in the input Text are analyzed before and after success return from parser => shallow substring which is deduplicated with high probability and no extra memcpy.
Another interesting idea, I would like to try, is to use ByteString for text input and mmap input byte string. SQL query is 99% ASCII only and all keywords are ASCII bytes. So knowing that constraint parser doesn't spend time on reconstructing characters out of impossible multibyte input, when it expects keywords.
I never dug into the internals of parsers enough to provide any meaningful comments on this. faster-megaparsec was a low-hanging fruit. The next thing I might try is to Church-encode the StateT since attoparsec and megaparsec are doing that, too. What annoys me more [2] about SQL is that its interface is purely text- based in most flavors, and the syntax is not optimized for parsing. E.g. even if the SQL server holds binary numbers, and our client program represents numbers in binary, in the SQL interface numbers have to be serialized/deserialized as (Byte)Strings. Others have commented that this might not be the most severe bottleneck, though. [2] https://mail.haskell.org/pipermail/haskell-cafe/2022-August/135444.html
Megaparsec doesn't have digital search tree, which help a lot with avoiding back tracking.
Thanks
On Tue, Nov 8, 2022 at 3:32 PM Olaf Klinke
wrote: Dear Haskellers,
megaparsec [1] is a parsing library with excellent error reporting. Being able to emit decent parse error messages is traded off against speed, however. Wouldn't it be nice to have a faster parser for your daily parsing needs, but obtain megaparsec's informative error messages for the rare cases when things go wrong?
Thanks to megaparsec's parser type class, MonadParsec, you can have both: I wrote a MonadParsec instance for StateT s Maybe (s being the input stream type such as Text or ByteString) which does not do all the bookkeeping ParsecT needs to do. If you manage to write your parser only using the type class primitives, then you can specialize your parser to StateT s Maybe. Only if that returns Nothing specialize again to ParsecT Void s and parse the input again, thus obtaining a decent error message. The package faster-megaparsec [1,2] provides a convenient function to do just that.
Of course StateT s Maybe can not provide all the features ParsecT has: * StateT s Maybe is always backtracking. * It does not know the offset in the stream. * It must hold the entire input stream in memory for a second pass, so no garbage collecting the consumed input while parsing. So if your parsing relies on non-backtracking choice, stream offset or garbage-collecting input while parsing, you can not use faster- megaparsec as a drop-in replacement.
Happy parsing! Olaf
[1] https://hackage.haskell.org/package/megaparsec [2] https://hackage.haskell.org/package/faster-megaparsec [3] https://hub.darcs.net/olf/faster-megaparsec
_______________________________________________ Haskell-Cafe mailing list To (un)subscribe, modify options or view archives go to: http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe Only members subscribed via the mailman list are allowed to post.
participants (3)
-
Daneel Yaitskov
-
Emily Pillmore
-
Olaf Klinke