
In the uu-parsinglib the choice combinator (<|>) is symmetric and does not commit to any alternative. If the grammar being parsed is ambiguous, the corresponding parser will fail at run time on an ambiguous input. Consider for instance this html parser. pTag = pOpenTag <|> pCloseTag <|> pCommentTag <|> pContent pContent = Content <$> some (satisfy (/= '<')) pHtml = some pTag This parser will fail on "<a>123</a>", because this may be interpreted as: [ [Open a , Content "123", Close a], [Open a, Content "12", Content "3", Close a], [Open a, Content "1", Content "2", Content "3", Close a] ... ] In other parsing libraries such as parsec the choice operator <|> is greedy and commits to the first alternative that makes any progress, so that some is greedy and Content "123" would be parsed. The operator <<|> in uu-parsinglib has the same greedy behaviour. I would like to disambiguate the grammar so that the first result ([Open a, Content "123", Close a]) is selected (longest matching rule). I suppose that this may be done using <<|> to define a greedySome to be used in in pContent, however I am wondering whether there is another way to do this, without using such operator <<|>. Any help is appreciated. All the best, Marco

Hi Marco,
I suppose that this may be done using <<|> to define a greedySome to be used in in pContent, however I am wondering whether there is another way to do this, without using such operator <<|>.
the 'micro' function attaches a cost a branch of the parse tree, and the library generally prefers the branch with the lowest cost. Maybe this is what you're looking for? Best regards, Peter

On 13 Jan 2015, at 11:57 , Marco Vassena
wrote: In the uu-parsinglib the choice combinator (<|>) is symmetric and does not commit to any alternative. If the grammar being parsed is ambiguous, the corresponding parser will fail at run time on an ambiguous input.
Consider for instance this html parser.
pTag = pOpenTag <|> pCloseTag <|> pCommentTag <|> pContent
pContent = Content <$> some (satisfy (/= '<')) pHtml = some pTag
This parser will fail on "<a>123</a>", because this may be interpreted as: [ [Open a , Content "123", Close a], [Open a, Content "12", Content "3", Close a], [Open a, Content "1", Content "2", Content "3", Close a] ... ]
In other parsing libraries such as parsec the choice operator <|> is greedy and commits to the first alternative that makes any progress, so that some is greedy and Content "123" would be parsed. The operator <<|> in uu-parsinglib has the same greedy behaviour.
I would like to disambiguate the grammar so that the first result ([Open a, Content "123", Close a]) is selected (longest matching rule). I suppose that this may be done using <<|> to define a greedySome to be used in in pContent, however I am wondering whether there is another way to do this, without using such operator <<|>.
The thing to do it to use the pList combinator which is greedy (the pList_ng counterpart is non-greedy): pContent = Content <$> pList (satisfy (/= '<')) An even better approach would be to reflect the structure of your HTML in your parser: pContent = tag_semantics <$> pOpenTag <*> pContent <* pCloseTag In case you re sure about the correctness of your HTML you might want to enforce, by using a monadic construct to check for the closing tag to correspons to the opening tag (note that I am in general not in favour of using monads, since they make the abstrcat interpretation going on expensive): pTagged = do t <- pOpentag c <- pContents pClosetag t return (Tagged t c) (of course with appropriate definitions for pOpentag and pCloseTag. Finally the module BasicInstances contains a combinator pMunch which you could have used: pContent = Content <$> pMunch (/= '<') This is far more efficient since it operates directly at the state. Doaitse
Any help is appreciated.
All the best, Marco _______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

The thing to do it to use the pList combinator which is greedy (the pList_ng counterpart is non-greedy):
pContent = Content <$> pList (satisfy (/= '<'))
Checking the source code pList is defined in terms of pFoldr, which uses opt, which eventually is defined in terms of <<|>.
An even better approach would be to reflect the structure of your HTML in your parser:
pContent = tag_semantics <$> pOpenTag <*> pContent <* pCloseTag
In case you re sure about the correctness of your HTML you might want to enforce, by using a monadic construct to check for the closing tag to correspons to the opening tag (note that I am in general not in favour of using monads, since they make the abstrcat interpretation going on expensive):
pTagged = do t <- pOpentag c <- pContents pClosetag t return (Tagged t c)
(of course with appropriate definitions for pOpentag and pCloseTag.
Unfortunately in html there are also empty tags, which don't need to be closed. For instance the line-break tag <br>: <h1> Line break tags are <br> not closed </h1> The bigger picture is that I am trying to figure out what are the core constructs needed to define a parser, therefore I want to have a rather abstract interface. In my set of core constructs there are: <$> : (a -> b) -> f a -> f b <*> : f (a -> b) -> f a -> f b <|> : f a -> f a -> f a -- (symmetric choice) pure : a -> f a fail : f a pToken : f Char Is it possible to define a parser that applies the longest matching rule using these constructs only? Or is it necessary to extend it with another primitive, for instance greedy choice <<|> ? (Note that f is abstract and it is not necessarily uu-parsinglib parsers). Marco On 13/gen/2015, at 12.22, S. Doaitse Swierstra wrote:
On 13 Jan 2015, at 11:57 , Marco Vassena
wrote: In the uu-parsinglib the choice combinator (<|>) is symmetric and does not commit to any alternative. If the grammar being parsed is ambiguous, the corresponding parser will fail at run time on an ambiguous input.
Consider for instance this html parser.
pTag = pOpenTag <|> pCloseTag <|> pCommentTag <|> pContent
pContent = Content <$> some (satisfy (/= '<')) pHtml = some pTag
This parser will fail on "<a>123</a>", because this may be interpreted as: [ [Open a , Content "123", Close a], [Open a, Content "12", Content "3", Close a], [Open a, Content "1", Content "2", Content "3", Close a] ... ]
In other parsing libraries such as parsec the choice operator <|> is greedy and commits to the first alternative that makes any progress, so that some is greedy and Content "123" would be parsed. The operator <<|> in uu-parsinglib has the same greedy behaviour.
I would like to disambiguate the grammar so that the first result ([Open a, Content "123", Close a]) is selected (longest matching rule). I suppose that this may be done using <<|> to define a greedySome to be used in in pContent, however I am wondering whether there is another way to do this, without using such operator <<|>.
The thing to do it to use the pList combinator which is greedy (the pList_ng counterpart is non-greedy):
pContent = Content <$> pList (satisfy (/= '<'))
An even better approach would be to reflect the structure of your HTML in your parser:
pContent = tag_semantics <$> pOpenTag <*> pContent <* pCloseTag
In case you re sure about the correctness of your HTML you might want to enforce, by using a monadic construct to check for the closing tag to correspons to the opening tag (note that I am in general not in favour of using monads, since they make the abstrcat interpretation going on expensive):
pTagged = do t <- pOpentag c <- pContents pClosetag t return (Tagged t c)
(of course with appropriate definitions for pOpentag and pCloseTag.
Finally the module BasicInstances contains a combinator pMunch which you could have used:
pContent = Content <$> pMunch (/= '<')
This is far more efficient since it operates directly at the state.
Doaitse
Any help is appreciated.
All the best, Marco _______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

On 13 Jan 2015, at 14:25 , Marco Vassena
wrote: The thing to do it to use the pList combinator which is greedy (the pList_ng counterpart is non-greedy):
pContent = Content <$> pList (satisfy (/= '<'))
Checking the source code pList is defined in terms of pFoldr, which uses opt, which eventually is defined in terms of <<|>.
An even better approach would be to reflect the structure of your HTML in your parser:
pContent = tag_semantics <$> pOpenTag <*> pContent <* pCloseTag
In case you re sure about the correctness of your HTML you might want to enforce, by using a monadic construct to check for the closing tag to correspons to the opening tag (note that I am in general not in favour of using monads, since they make the abstrcat interpretation going on expensive):
pTagged = do t <- pOpentag c <- pContents pClosetag t return (Tagged t c)
(of course with appropriate definitions for pOpentag and pCloseTag.
Unfortunately in html there are also empty tags, which don't need to be closed. For instance the line-break tag <br>: <h1> Line break tags are <br> not closed </h1>
The bigger picture is that I am trying to figure out what are the core constructs needed to define a parser, therefore I want to have a rather abstract interface. In my set of core constructs there are: <$> : (a -> b) -> f a -> f b <*> : f (a -> b) -> f a -> f b <|> : f a -> f a -> f a -- (symmetric choice) pure : a -> f a fail : f a pToken : f Char
Is it possible to define a parser that applies the longest matching rule using these constructs only? Or is it necessary to extend it with another primitive, for instance greedy choice <<|> ? (Note that f is abstract and it is not necessarily uu-parsinglib parsers).
Marco
I do not think so. As long as your grammar is essentially ambiguous you will get all parses. Using the amb combinator you can capture this, and avoid the runtime error. Then you can lateron decide which tree too take. I am afraid however that this will may lead to an exponential blowup in your case. If the set og <br>like tags is statically known they might be taken care of in a special way? by the parser? Doaitse
On 13/gen/2015, at 12.22, S. Doaitse Swierstra wrote:
On 13 Jan 2015, at 11:57 , Marco Vassena
wrote: In the uu-parsinglib the choice combinator (<|>) is symmetric and does not commit to any alternative. If the grammar being parsed is ambiguous, the corresponding parser will fail at run time on an ambiguous input.
Consider for instance this html parser.
pTag = pOpenTag <|> pCloseTag <|> pCommentTag <|> pContent
pContent = Content <$> some (satisfy (/= '<')) pHtml = some pTag
This parser will fail on "<a>123</a>", because this may be interpreted as: [ [Open a , Content "123", Close a], [Open a, Content "12", Content "3", Close a], [Open a, Content "1", Content "2", Content "3", Close a] ... ]
In other parsing libraries such as parsec the choice operator <|> is greedy and commits to the first alternative that makes any progress, so that some is greedy and Content "123" would be parsed. The operator <<|> in uu-parsinglib has the same greedy behaviour.
I would like to disambiguate the grammar so that the first result ([Open a, Content "123", Close a]) is selected (longest matching rule). I suppose that this may be done using <<|> to define a greedySome to be used in in pContent, however I am wondering whether there is another way to do this, without using such operator <<|>.
The thing to do it to use the pList combinator which is greedy (the pList_ng counterpart is non-greedy):
pContent = Content <$> pList (satisfy (/= '<'))
An even better approach would be to reflect the structure of your HTML in your parser:
pContent = tag_semantics <$> pOpenTag <*> pContent <* pCloseTag
In case you re sure about the correctness of your HTML you might want to enforce, by using a monadic construct to check for the closing tag to correspons to the opening tag (note that I am in general not in favour of using monads, since they make the abstrcat interpretation going on expensive):
pTagged = do t <- pOpentag c <- pContents pClosetag t return (Tagged t c)
(of course with appropriate definitions for pOpentag and pCloseTag.
Finally the module BasicInstances contains a combinator pMunch which you could have used:
pContent = Content <$> pMunch (/= '<')
This is far more efficient since it operates directly at the state.
Doaitse
Any help is appreciated.
All the best, Marco _______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

Marco Vassena wrote:
The bigger picture is that I am trying to figure out what are the core constructs needed to define a parser, therefore I want to have a rather abstract interface. In my set of core constructs there are: <$> : (a -> b) -> f a -> f b <*> : f (a -> b) -> f a -> f b <|> : f a -> f a -> f a -- (symmetric choice) pure : a -> f a fail : f a pToken : f Char
Is it possible to define a parser that applies the longest matching rule using these constructs only?
No, these are not sufficient. Longest-matching rule (aka maximal munch) is defined as applying a parser for as long as it succeeds. Therefore, we need a way to determine if a parser succeeded: we need what is called a (monadic) reflection. In simpler words, we need an operation like the soft-cut in Prolog. It can be defined in terms of a primitive called `msplit', described in the LogicT paper: http://okmij.org/ftp/Computation/monads.html#LogicT The following article talks about maximal munch in more detail, in terms of logic programming: http://okmij.org/ftp/kakuritu/logic-programming.html#many

On 15-01-13 08:25 AM, Marco Vassena wrote:
Unfortunately in html there are also empty tags, which don't need to be closed. For instance the line-break tag <br>: <h1> Line break tags are <br> not closed </h1>
The bigger picture is that I am trying to figure out what are the core constructs needed to define a parser, therefore I want to have a rather abstract interface. In my set of core constructs there are: <$> : (a -> b) -> f a -> f b <*> : f (a -> b) -> f a -> f b <|> : f a -> f a -> f a -- (symmetric choice) pure : a -> f a fail : f a pToken : f Char
Is it possible to define a parser that applies the longest matching rule using these constructs only? Or is it necessary to extend it with another primitive, for instance greedy choice <<|> ? (Note that f is abstract and it is not necessarily uu-parsinglib parsers).
You can parse HTML with no ambiguous results if you allow monadic bind (>>=) as well: pTag = pElement <|> pCommentTag <|> pContent pElement = do elemName <- pOpenTag elemContent <- pTag `manyTill` endElement elemName endElement elemName = string "" *> string elemName *> string ">" <|> lookahead (string "" *> some (satisfy (/= '>')) *> string ">") pContent = Content <$> some (satisfy (/= '<')) pHtml = some pTag Mind you, this code would not give you exactly the same parse tree as an HTML 5 browser would. That spec is a nightmare.

This is the difference between Parsing Expression Grammars and more general Context-Free Grammars. https://en.wikipedia.org/wiki/Parsing_expression_grammar You'll have to re-write your grammar to be unambiguous if you want to use a library based on CFGs. Marco Vassena writes:
In the uu-parsinglib the choice combinator (<|>) is symmetric and does not commit to any alternative. If the grammar being parsed is ambiguous, the corresponding parser will fail at run time on an ambiguous input.
Consider for instance this html parser.
pTag = pOpenTag <|> pCloseTag <|> pCommentTag <|> pContent
pContent = Content <$> some (satisfy (/= '<')) pHtml = some pTag
This parser will fail on "<a>123</a>", because this may be interpreted as: [ [Open a , Content "123", Close a], [Open a, Content "12", Content "3", Close a], [Open a, Content "1", Content "2", Content "3", Close a] ... ]
In other parsing libraries such as parsec the choice operator <|> is greedy and commits to the first alternative that makes any progress, so that some is greedy and Content "123" would be parsed. The operator <<|> in uu-parsinglib has the same greedy behaviour.
I would like to disambiguate the grammar so that the first result ([Open a, Content "123", Close a]) is selected (longest matching rule). I suppose that this may be done using <<|> to define a greedySome to be used in in pContent, however I am wondering whether there is another way to do this, without using such operator <<|>.
Any help is appreciated.
All the best, Marco _______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
-- Kyle Marek-Spartz
participants (6)
-
Kyle Marek-Spartz
-
Marco Vassena
-
Mario Blažević
-
oleg@okmij.org
-
Peter Simons
-
S. Doaitse Swierstra