Re: [Haskell-cafe] Alternative instance for non-backtracking parsers

newer
HDBC packages looking for...

older
Testing of GHC extensions &...

Olaf Klinke

30 Aug 2018 30 Aug '18

6:21 p.m.

...

Hello, Olaf. I have some distrust of elegant solutions (one of them are C.P. libs).

I have a program that parses several CSV files, one of them 50MB in size, and writes its result as HTML. When I started optimizing, the execution time was 144 seconds. Profiling (thanks to Jasper Van der Jeugt for writing profiteur!) revealed that most of the time was spent parsing and postprocessing the 50MB CSV file. Changing the data structure of the postprocessing stage cut down the execution time to 32 seconds, but still the majority is spent on parsing. Then I realized that (StateT String Maybe) is a parser which conveniently has all the class instances one needs, most notably its Alternative instance make it a backtracking parser. After defining a few combinators I was able to swap out my megaparsec parser against the new parser, which slashed execution time in half. Now most of the parsing time is dedicated to transforming text to numbers and dates. I doubt that parsing time can be reduced much further [*]. The new parser was identical to the old parser, only the combinators now come from another module. That is the elegant thing about monadic parser libraries. I will now use the fast parser by default, and if it returns a Nothing, the program will suggest a command line flag that switches to the original megaparsec parser, exactly telling the user where the parse failed and why. I am not sure whether there is another family of parsers that have interfaces so similar that switching from one package to another is as effortless as monadic parsers. Cheers Olaf [*] To the parser experts on this list: How much time should a parser take that processes a 50MB, 130000-line text file, extracting 5 values (String, UTCTime, Int, Double) from each line?

Show replies by date

Bardur Arantsson

30 Aug 30 Aug

7:43 p.m.

New subject: Alternative instance for non-backtracking parsers

On 30/08/2018 20.21, Olaf Klinke wrote:

...

...
Hello, Olaf. I have some distrust of elegant solutions (one of them are C.P. libs).

[*] To the parser experts on this list: How much time should a parser take that processes a 50MB, 130000-line text file, extracting 5 values (String, UTCTime, Int, Double) from each line?

Not an expert, but for something as (relatively!) standard as CSV, I'd probably go for a specialized solution like 'cassava', which seems like it does quite well according to https://github.com/haskell-perf/csv Based purely the lines/second numbers on that page and the number you've given, I'd guesstimate that your parsing could potentially be as fast as (3.185ms / 1000 lines) * 130000 lines = 414.05ms = 0.4 s. (Of coure that still doesn't account for extracting the Int, Double, etc., but there are also specialized solutions for that which should be pretty hard to beat, see e.g. bytestring-lexing.) It's also probably a bit less elegant than a generic parsec-like thing, but that's to be expected for a more special-case solution. Regards,

Bryan Richter

31 Aug 31 Aug

12:50 p.m.

New subject: Alternative instance for non-backtracking parsers

On 08/30/2018 03:43 PM, Bardur Arantsson wrote:

...

On 30/08/2018 20.21, Olaf Klinke wrote:

...
...
Hello, Olaf. I have some distrust of elegant solutions (one of them are C.P. libs).

[*] To the parser experts on this list: How much time should a parser take that processes a 50MB, 130000-line text file, extracting 5 values (String, UTCTime, Int, Double) from each line?

Not an expert, but for something as (relatively!) standard as CSV, I'd probably go for a specialized solution like 'cassava', which seems like it does quite well according to https://github.com/haskell-perf/csv

Playing the devil's advocate here.... If parser combinators aren't great for "relatively standard" things such as CSV, then what *are* they good for?

Bardur Arantsson

2:47 p.m.

New subject: Alternative instance for non-backtracking parsers

On 31/08/2018 14.50, Bryan Richter wrote:

...

On 08/30/2018 03:43 PM, Bardur Arantsson wrote:

...
On 30/08/2018 20.21, Olaf Klinke wrote:

...
...
Hello, Olaf. I have some distrust of elegant solutions (one of them are C.P. libs).

[*] To the parser experts on this list: How much time should a parser take that processes a 50MB, 130000-line text file, extracting 5 values (String, UTCTime, Int, Double) from each line?

Not an expert, but for something as (relatively!) standard as CSV, I'd probably go for a specialized solution like 'cassava', which seems like it does quite well according to https://github.com/haskell-perf/csv

Playing the devil's advocate here....

If parser combinators aren't great for "relatively standard" things such as CSV, then what *are* they good for?

Off the top of my head: a) They're often much more readable than many alternatives if you *don't* actually care *too* much about performance. (Especially if you have semantic actions, I find that they're *much* more readable.) b) They're usually a lot better for "tricky" languages which e.g. might not be context-free. c) Embedding the "grammar" directly in Haskell means that there's no weirdness around integrating generated code/types with the rest of the program. (Probably several other obvious things, I'm forgetting.) Regards,

Peter Simons

2:55 p.m.

New subject: Alternative instance for non-backtracking parsers

Hi Bryan,

...

If parser combinators aren't great for "relatively standard" things such as CSV, then what *are* they good for?

there is no such thing as a "parser combinator library". There are many different libraries for writing parsers, and these libraries each have vastly different design goals. Attoparsec, for example, is specifically designed to be fast, and I don't see any reason why it shouldn't be possible to implement a high-performance CSV parsing library on top of that package. In fact, I believe Cassava does exactly that. Parsec and the newer Megaparsec, on the other hand, put more emphasis on good error messages than an performance. Now, good error messages matter when you're parsing a sophisticated input language with many different non-trivial constructs, but for a CSV parser there's really not much to be done in terms of "good error messages" because the syntax is so simple. Therefore, Parsec might not be the best choice for that particular use-case. Anyway, I don't think that it makes much sense to talk about all those libraries as if they were all the same and had all the same properties, because they don't. Best regards, Peter

Ben Franksen

4:43 p.m.

New subject: Alternative instance for non-backtracking parsers

Am 31.08.2018 um 16:55 schrieb Peter Simons:

...

Parsec and the newer Megaparsec, on the other hand, put more emphasis on good error messages than an performance. Now, good error messages matter when you're parsing a sophisticated input language with many different non-trivial constructs, but for a CSV parser there's really not much to be done in terms of "good error messages" because the syntax is so simple. Therefore, Parsec might not be the best choice for that particular use-case.

Indeed, since CSV is a regular language I'd exploit that by using e.g. regex-applicative and enjoy guaranteed linear time parsing. Cheers Ben

Brandon Allbery

4:39 p.m.

New subject: Alternative instance for non-backtracking parsers

I'm tempted to say that for things like HTML and CSV, the real reason not to use a parser combinator library is that "standard" turns out to mean pretty much nothing whatsoever. And you want to use someone else's battle-tested workarounds developed over years for a "standard format" that's more or less a free-for-all. On Fri, Aug 31, 2018 at 8:51 AM Bryan Richter wrote:

...

On 08/30/2018 03:43 PM, Bardur Arantsson wrote:

...
On 30/08/2018 20.21, Olaf Klinke wrote:

...
...
Hello, Olaf. I have some distrust of elegant solutions (one of them are C.P. libs).

[*] To the parser experts on this list: How much time should a parser take that processes a 50MB, 130000-line text file, extracting 5 values (String, UTCTime, Int, Double) from each line?

Not an expert, but for something as (relatively!) standard as CSV, I'd probably go for a specialized solution like 'cassava', which seems like it does quite well according to https://github.com/haskell-perf/csv

Playing the devil's advocate here....

If parser combinators aren't great for "relatively standard" things such as CSV, then what *are* they good for?

_______________________________________________ Haskell-Cafe mailing list To (un)subscribe, modify options or view archives go to: http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe Only members subscribed via the mailman list are allowed to post.

-- brandon s allbery kf8nh allbery.b@gmail.com

Will Yager

3 Sep 3 Sep

3:29 p.m.

New subject: Alternative instance for non-backtracking parsers

On Aug 30, 2018, at 11:21, Olaf Klinke wrote:

...

[*] To the parser experts on this list: How much time should a parser take that processes a 50MB, 130000-line text file, extracting 5 values (String, UTCTime, Int, Double) from each line? _______________________________________________

The combination of attoparsec + a streaming adapter for pipes/conduit/streaming should easily be able to handle tens of megabytes per second and hundreds of thousands of lines per second. For an example, check out https://github.com/wyager/Callsigns/blob/master/Callsigns.hs Which parses a pipe-separated-value file from the FCC pretty quickly. As I recall it goes through a >100MB file in under three seconds, and it has to do a bunch of other work besides. I also ported the above code to use Streaming instead of Pipes. I recall that using Streaming master, the parser I use to read the dictionary: takeTill isEndOfLine <* endOfLine Handles about 3 million lines per second. I can’t remember what the number is for Pipes but it’s probably similar. That’s really good for such a simple thing to write! Unfortunately there is a performance bug in Streaming that’s fixed in master but hasn’t been released for a number of months :-/ —Will

Olaf Klinke

6:33 p.m.

New subject: Alternative instance for non-backtracking parsers

...

Am 03.09.2018 um 17:29 schrieb Will Yager :

On Aug 30, 2018, at 11:21, Olaf Klinke wrote:

...
[*] To the parser experts on this list: How much time should a parser take that processes a 50MB, 130000-line text file, extracting 5 values (String, UTCTime, Int, Double) from each line? _______________________________________________

The combination of attoparsec + a streaming adapter for pipes/conduit/streaming should easily be able to handle tens of megabytes per second and hundreds of thousands of lines per second. That's good to know, so there is plenty of room for improvement left.

For an example, check out https://github.com/wyager/Callsigns/blob/master/Callsigns.hs

Which parses a pipe-separated-value file from the FCC pretty quickly. As I recall it goes through a >100MB file in under three seconds, and it has to do a bunch of other work besides. The parser does nothing except chunk up the line's text and replace parts of it by constants. I'm surprised and pleased though that HashMaps have such good performance.

Profiling shows that my parser now spends most time converting to numbers and dates. I wrote a primitive skipSep :: Int -> Char -> Parser () which skips over input until it has read a given number of the given character. This cut down overall execution time from 12s to 6s, meaning the parsing time is down by more than 50%. Seems the combinators like *> and >>= and 'manyTill' do have a non-neglegible cost, so combining from fewer parts makes the parser faster. Would other parser users agree? Olaf

2511

Age (days ago)

2515

Last active (days ago)

List overview

Download

8 comments

7 participants

participants (7)

Bardur Arantsson
Ben Franksen
Brandon Allbery
Bryan Richter
Olaf Klinke
Peter Simons
Will Yager