Parsing a bunch of strings without backtracking

Hi all, I come across an idea about parsing a bunch of strings, just want to share this idea and looking for some comments and maybe pointing me to some related articles. The idea is when you have: 1 <$ string "aabb" <|> 2 <$ string "aacc" you can rewrite it as: string "aa" >> (1 <$ string "bb" <|> 2 <$ string "cc") so that you don't have to deal with the common prefix "aa" twice. Then I further extend this idea: if we have a set of strings (s1, s2, ...), whose element is not a prefix of any other set member, why don't we turn things like (v1 <$ string s1 <|> v2 <$ string s2 <|> ...) into a "tree-structure" so we don't need to deal with any common prefix more than once? So this motivates my little toy implementation at https://gist.github.com/Javran/1cbfe9897d9a5c8fae5d20a33fd83813 basically given a list [(s1,v1),(s2,v2)...], we want to generate a parser equivalent to (v1 <$ string s1 <|> v2 <$ string s2 <|> ...) but perhaps with less backtracking. My code does this by turning the set of strings into a tree structure (see function "compact") and using this "Compact" representation as a driver to generate the parser. I used ReadP to have a concrete parser for testing, but this should work with any parsing library supporting MonadPlus. I think they are bunch of things that can be improved: - for now every intermediate node of "Compact" contains just a char, so this will end up generating a parser that has branch `char 'f' >> char 'o' >> 'o'` in it instead of the slightly more efficient version `string "foo"` - we could further attach a weight to each string, this allows rearranging (<|>) chains so that a more frequent string parser appears earlier. - if the input [(s1,v1),(s2,v2)...] are all known constants at compile time, probably TemplateHaskell can be used for generating the parser at compile time (as a Haskell expression). I'm not sure how difficult this would be though, as I have very limited understanding of TH. but before I explore further, I want to collect some comments as I said in the beginning. Best, -- Javran (Fang) Cheng

Have you seen uu-parsinglib (http://foswiki.cs.uu.nl/foswiki/HUT/ParserCombinators)? It implements a non-backtracking parser that works similarly to what you describe. On 9 March 2017 01:21:09 GMT+01:00, Javran Chengwrote: >Hi all, > >I come across an idea about parsing a bunch of strings, >just want to share this idea and looking for some comments and maybe >pointing me to some related articles. > >The idea is when you have: > >1 <$ string "aabb" <|> 2 <$ string "aacc" > >you can rewrite it as: > >string "aa" >> (1 <$ string "bb" <|> 2 <$ string "cc") > >so that you don't have to deal with the common prefix "aa" twice. > >Then I further extend this idea: if we have a set of strings (s1, s2, >...), >whose element is not a prefix of any other set member, >why don't we turn things like (v1 <$ string s1 <|> v2 <$ string s2 <|> >...) >into a "tree-structure" so we don't need to deal with >any common prefix more than once? > >So this motivates my little toy implementation at >https://gist.github.com/Javran/1cbfe9897d9a5c8fae5d20a33fd83813 > >basically given a list [(s1,v1),(s2,v2)...], we want to generate a >parser >equivalent to (v1 <$ string s1 <|> v2 <$ string s2 <|> ...) >but perhaps with less backtracking. My code does this by turning the >set of >strings into a tree structure (see function "compact") and >using this "Compact" representation as a driver to generate the parser. >I used ReadP to have a concrete parser for testing, but this should >work >with any parsing library supporting MonadPlus. > >I think they are bunch of things that can be improved: > >- for now every intermediate node of "Compact" contains just a char, so >this will end up generating a parser that has branch `char 'f' >> char >'o' >>> 'o'` in it > instead of the slightly more efficient version `string "foo"` >- we could further attach a weight to each string, this allows >rearranging >(<|>) chains so that a more frequent string parser appears earlier. >- if the input [(s1,v1),(s2,v2)...] are all known constants at compile >time, probably TemplateHaskell can be used for generating the parser at >compile time (as a Haskell expression). >I'm not sure how difficult this would be though, as I have very limited >understanding of TH. > >but before I explore further, I want to collect some comments as I said >in >the beginning. > >Best, >-- >Javran (Fang) Cheng -- Sent from my Android device with K-9 Mail. Please excuse my brevity.

Hi Jaro, thanks for the link, having worked with few parsing libs but uu-parsinglib sounds new to me and I'm definitely thinking about having a try. I'm taking a look but haven't yet found any part of code that attempts to factor common prefixes out. On Wed, Mar 8, 2017 at 7:48 PM, Jarowrote: > Have you seen uu-parsinglib (http://foswiki.cs.uu.nl/ > foswiki/HUT/ParserCombinators)? It implements a non-backtracking parser > that works similarly to what you describe. > > > On 9 March 2017 01:21:09 GMT+01:00, Javran Cheng > wrote: >> >> Hi all, >> >> I come across an idea about parsing a bunch of strings, >> just want to share this idea and looking for some comments and maybe >> pointing me to some related articles. >> >> The idea is when you have: >> >> 1 <$ string "aabb" <|> 2 <$ string "aacc" >> >> you can rewrite it as: >> >> string "aa" >> (1 <$ string "bb" <|> 2 <$ string "cc") >> >> so that you don't have to deal with the common prefix "aa" twice. >> >> Then I further extend this idea: if we have a set of strings (s1, s2, >> ...), whose element is not a prefix of any other set member, >> why don't we turn things like (v1 <$ string s1 <|> v2 <$ string s2 <|> >> ...) into a "tree-structure" so we don't need to deal with >> any common prefix more than once? >> >> So this motivates my little toy implementation at >> https://gist.github.com/Javran/1cbfe9897d9a5c8fae5d20a33fd83813 >> >> basically given a list [(s1,v1),(s2,v2)...], we want to generate a parser >> equivalent to (v1 <$ string s1 <|> v2 <$ string s2 <|> ...) >> but perhaps with less backtracking. My code does this by turning the set >> of strings into a tree structure (see function "compact") and >> using this "Compact" representation as a driver to generate the parser. >> I used ReadP to have a concrete parser for testing, but this should work >> with any parsing library supporting MonadPlus. >> >> I think they are bunch of things that can be improved: >> >> - for now every intermediate node of "Compact" contains just a char, so >> this will end up generating a parser that has branch `char 'f' >> char 'o' >> >> 'o'` in it >> instead of the slightly more efficient version `string "foo"` >> - we could further attach a weight to each string, this allows >> rearranging (<|>) chains so that a more frequent string parser appears >> earlier. >> - if the input [(s1,v1),(s2,v2)...] are all known constants at compile >> time, probably TemplateHaskell can be used for generating the parser at >> compile time (as a Haskell expression). >> I'm not sure how difficult this would be though, as I have very >> limited understanding of TH. >> >> but before I explore further, I want to collect some comments as I said >> in the beginning. >> >> Best, >> -- >> Javran (Fang) Cheng >> > > -- > Sent from my Android device with K-9 Mail. Please excuse my brevity. > -- Javran (Fang) Cheng

After thinking about it more (I still don't completely understand uu-parsinglib myself). I think achieves non-backtracking by running the parsers in parallel and then returning the first one that succeeds. So I guess it doesn't really optimize away prefixes. The 'best' function is the function that merges two "sequences" of the intermediate type 'Steps a' and that causes the parsers to be run in parallel. The source is here: https://hackage.haskell.org/package/uu-parsinglib-2.9.1.1/docs/src/Text-ParserCombinators-UU-Core.html#best. On 9 March 2017 03:41:30 GMT+01:00, Javran Chengwrote: >Hi Jaro, thanks for the link, having worked with few parsing libs but >uu-parsinglib sounds new to me and I'm definitely thinking about having >a >try. >I'm taking a look but haven't yet found any part of code that attempts >to >factor common prefixes out. > >On Wed, Mar 8, 2017 at 7:48 PM, Jaro >wrote: > >> Have you seen uu-parsinglib (http://foswiki.cs.uu.nl/ >> foswiki/HUT/ParserCombinators)? It implements a non-backtracking >parser >> that works similarly to what you describe. >> >> >> On 9 March 2017 01:21:09 GMT+01:00, Javran Cheng >> wrote: >>> >>> Hi all, >>> >>> I come across an idea about parsing a bunch of strings, >>> just want to share this idea and looking for some comments and maybe >>> pointing me to some related articles. >>> >>> The idea is when you have: >>> >>> 1 <$ string "aabb" <|> 2 <$ string "aacc" >>> >>> you can rewrite it as: >>> >>> string "aa" >> (1 <$ string "bb" <|> 2 <$ string "cc") >>> >>> so that you don't have to deal with the common prefix "aa" twice. >>> >>> Then I further extend this idea: if we have a set of strings (s1, >s2, >>> ...), whose element is not a prefix of any other set member, >>> why don't we turn things like (v1 <$ string s1 <|> v2 <$ string s2 ><|> >>> ...) into a "tree-structure" so we don't need to deal with >>> any common prefix more than once? >>> >>> So this motivates my little toy implementation at >>> https://gist.github.com/Javran/1cbfe9897d9a5c8fae5d20a33fd83813 >>> >>> basically given a list [(s1,v1),(s2,v2)...], we want to generate a >parser >>> equivalent to (v1 <$ string s1 <|> v2 <$ string s2 <|> ...) >>> but perhaps with less backtracking. My code does this by turning the >set >>> of strings into a tree structure (see function "compact") and >>> using this "Compact" representation as a driver to generate the >parser. >>> I used ReadP to have a concrete parser for testing, but this should >work >>> with any parsing library supporting MonadPlus. >>> >>> I think they are bunch of things that can be improved: >>> >>> - for now every intermediate node of "Compact" contains just a char, >so >>> this will end up generating a parser that has branch `char 'f' >> >char 'o' >>> >> 'o'` in it >>> instead of the slightly more efficient version `string "foo"` >>> - we could further attach a weight to each string, this allows >>> rearranging (<|>) chains so that a more frequent string parser >appears >>> earlier. >>> - if the input [(s1,v1),(s2,v2)...] are all known constants at >compile >>> time, probably TemplateHaskell can be used for generating the parser >at >>> compile time (as a Haskell expression). >>> I'm not sure how difficult this would be though, as I have very >>> limited understanding of TH. >>> >>> but before I explore further, I want to collect some comments as I >said >>> in the beginning. >>> >>> Best, >>> -- >>> Javran (Fang) Cheng >>> >> >> -- >> Sent from my Android device with K-9 Mail. Please excuse my brevity. >> > > > >-- >Javran (Fang) Cheng -- Sent from my Android device with K-9 Mail. Please excuse my brevity.

Am 09.03.2017 um 01:21 schrieb Javran Cheng:
I come across an idea about parsing a bunch of strings, just want to share this idea and looking for some comments and maybe pointing me to some related articles.
The idea is when you have:
1 <$ string "aabb" <|> 2 <$ string "aacc"
you can rewrite it as:
string "aa" >> (1 <$ string "bb" <|> 2 <$ string "cc")
so that you don't have to deal with the common prefix "aa" twice.
In the parsing literature this is called "left factoring (a grammar)". Left factoring is known to improve efficiency and there are published algorithms to do it automatically; but -- as you observed -- to perform it you need an explicit representation of the grammar, so you can transform it before turning it into a parser; some would call this a "deep embedding".
Then I further extend this idea: if we have a set of strings (s1, s2, ...), whose element is not a prefix of any other set member, why don't we turn things like (v1 <$ string s1 <|> v2 <$ string s2 <|> ...) into a "tree-structure" so we don't need to deal with any common prefix more than once?
The problem is how to represent the grammar in a statically typed way and then to transform it preserving the well-typing. This can be done, but the price is high. Look for packages (and papers) named 'christmastree' and 'tttas' to get an idea. Even the authors admit that their approach is extremely involved and that offline methods (such as TH which you mentioned) may be more practical. Cheers Ben
participants (3)
-
Ben Franksen
-
Jaro
-
Javran Cheng