parsec or attoparsec for 40-50MB text files ?

Hi, My file is pretty straightforward text file with a small amount of somewhat annoying state: comments* config line comments* data line* if there is no config line it's an error. the data lines can have a variable number of values and it matters how many values there are (hey- it's not my file format !). the data lines can also have a comment at the end. My initial thought was to go with parsec but the data files could be as large as 40-50MB and upon further reading it really seemed like attoparsec would be better. Error handling wouldn't be too sophisticated. if a data line has something other than 1 or more floating point values and the optional comment, failing out with "error line X" is fine. parse time is somewhat critical only because i'll have multiple files to parse, so while 5-10 seconds is ok for one file, i have to multiply that by 5-10. I've seen several comments talking about the fact that parsec can be slow, but so far unable to find anything the quantifies "slow". Any opinions on which would be better for my application (although i think i've just talked myself into using attoparsec) ? In particular- am i going to get at least reasonable "error on line X" error handling using attoparsec ? Thanks, Brian

Hi Brian,
Parsec and Attoparsec have very similar interfaces (afaik the only
difference is that Attoparsec backtracks by default, so the "try"
combinator is a no-op) so there's no harm in trying both.
Alternatively: if the data format is simple enough, you can write the
parser by hand. The Data.Text.Read module may help if you pursue this
option. [1]
Chris
[1]: https://hackage.haskell.org/package/text-1.2.1.1/docs/Data-Text-Read.html
On Mon, Jun 8, 2015 at 11:04 AM,
Hi,
My file is pretty straightforward text file with a small amount of somewhat annoying state:
comments* config line comments* data line*
if there is no config line it's an error. the data lines can have a variable number of values and it matters how many values there are (hey- it's not my file format !). the data lines can also have a comment at the end.
My initial thought was to go with parsec but the data files could be as large as 40-50MB and upon further reading it really seemed like attoparsec would be better. Error handling wouldn't be too sophisticated. if a data line has something other than 1 or more floating point values and the optional comment, failing out with "error line X" is fine.
parse time is somewhat critical only because i'll have multiple files to parse, so while 5-10 seconds is ok for one file, i have to multiply that by 5-10.
I've seen several comments talking about the fact that parsec can be slow, but so far unable to find anything the quantifies "slow".
Any opinions on which would be better for my application (although i think i've just talked myself into using attoparsec) ?
In particular- am i going to get at least reasonable "error on line X" error handling using attoparsec ?
Thanks,
Brian
_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe

offtopic, but since we are talking about Parsec/Attoparsec, is there a way
to have try by default in Parsec as well?
On Mon, Jun 8, 2015 at 9:23 AM Chris Wong
Hi Brian,
Parsec and Attoparsec have very similar interfaces (afaik the only difference is that Attoparsec backtracks by default, so the "try" combinator is a no-op) so there's no harm in trying both.
Alternatively: if the data format is simple enough, you can write the parser by hand. The Data.Text.Read module may help if you pursue this option. [1]
Chris
[1]: https://hackage.haskell.org/package/text-1.2.1.1/docs/Data-Text-Read.html
Hi,
My file is pretty straightforward text file with a small amount of somewhat annoying state:
comments* config line comments* data line*
if there is no config line it's an error. the data lines can have a variable number of values and it matters how many values there are (hey- it's not my file format !). the data lines can also have a comment at the end.
My initial thought was to go with parsec but the data files could be as large as 40-50MB and upon further reading it really seemed like attoparsec would be better. Error handling wouldn't be too sophisticated. if a data
On Mon, Jun 8, 2015 at 11:04 AM,
wrote: line has something other than 1 or more floating point values and the optional comment, failing out with "error line X" is fine. parse time is somewhat critical only because i'll have multiple files to
parse, so while 5-10 seconds is ok for one file, i have to multiply that by 5-10.
I've seen several comments talking about the fact that parsec can be
slow, but so far unable to find anything the quantifies "slow".
Any opinions on which would be better for my application (although i
think i've just talked myself into using attoparsec) ?
In particular- am i going to get at least reasonable "error on line X"
error handling using attoparsec ?
Thanks,
Brian
_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe
-- https://lambda.xyz _______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe

You may want to try:
https://hackage.haskell.org/package/attoparsec-parsec
João
2015-06-08 2:36 GMT+01:00 Raphael Gaschignard
offtopic, but since we are talking about Parsec/Attoparsec, is there a way to have try by default in Parsec as well?
On Mon, Jun 8, 2015 at 9:23 AM Chris Wong
wrote: Hi Brian,
Parsec and Attoparsec have very similar interfaces (afaik the only difference is that Attoparsec backtracks by default, so the "try" combinator is a no-op) so there's no harm in trying both.
Alternatively: if the data format is simple enough, you can write the parser by hand. The Data.Text.Read module may help if you pursue this option. [1]
Chris
[1]: https://hackage.haskell.org/package/text-1.2.1.1/docs/Data-Text-Read.html
Hi,
My file is pretty straightforward text file with a small amount of somewhat annoying state:
comments* config line comments* data line*
if there is no config line it's an error. the data lines can have a variable number of values and it matters how many values there are (hey- it's not my file format !). the data lines can also have a comment at the end.
My initial thought was to go with parsec but the data files could be as large as 40-50MB and upon further reading it really seemed like attoparsec would be better. Error handling wouldn't be too sophisticated. if a data
On Mon, Jun 8, 2015 at 11:04 AM,
wrote: line has something other than 1 or more floating point values and the optional comment, failing out with "error line X" is fine. parse time is somewhat critical only because i'll have multiple files
to parse, so while 5-10 seconds is ok for one file, i have to multiply that by 5-10.
I've seen several comments talking about the fact that parsec can be
slow, but so far unable to find anything the quantifies "slow".
Any opinions on which would be better for my application (although i
think i've just talked myself into using attoparsec) ?
In particular- am i going to get at least reasonable "error on line X"
error handling using attoparsec ?
Thanks,
Brian
_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe
-- https://lambda.xyz _______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe
_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe

On 06/08/2015 04:28 AM, João Cristóvão wrote:
You may want to try: https://hackage.haskell.org/package/attoparsec-parsec
There is also Edward Kmett's `parsers` library [1], which supports `parsec`, `attoparsec`, as well as his own `trifecta` library behind the same interface. Cheers, - Ben [1] https://hackage.haskell.org/package/parsers

On Mon, 08 Jun 2015 04:28:31 -0400
Ben Gamari
On 06/08/2015 04:28 AM, João Cristóvão wrote:
You may want to try: https://hackage.haskell.org/package/attoparsec-parsec
There is also Edward Kmett's `parsers` library [1], which supports `parsec`, `attoparsec`, as well as his own `trifecta` library behind the same interface.
good grief, the problem with haskell parsers is that there are so many to choose from ! i'm going to start with attoparsec. one reason is that I really want to use the lazy interface just to see if i can and how it works. Also as Chris said, once i learn attoparsec I can learn the others, e.g. parsec, without too much trouble. error trapping is not too critical (although i am going to need a line number). the other slight complication is maintaining some sort of parse state- but i'll deal with that when i get to it which probably won't be for a while... Thanks very much for all the suggestions ! Brian
participants (5)
-
Ben Gamari
-
briand@aracnet.com
-
Chris Wong
-
João Cristóvão
-
Raphael Gaschignard