
On 13 Jan 2015, at 11:57 , Marco Vassena
wrote: In the uu-parsinglib the choice combinator (<|>) is symmetric and does not commit to any alternative. If the grammar being parsed is ambiguous, the corresponding parser will fail at run time on an ambiguous input.
Consider for instance this html parser.
pTag = pOpenTag <|> pCloseTag <|> pCommentTag <|> pContent
pContent = Content <$> some (satisfy (/= '<')) pHtml = some pTag
This parser will fail on "<a>123</a>", because this may be interpreted as: [ [Open a , Content "123", Close a], [Open a, Content "12", Content "3", Close a], [Open a, Content "1", Content "2", Content "3", Close a] ... ]
In other parsing libraries such as parsec the choice operator <|> is greedy and commits to the first alternative that makes any progress, so that some is greedy and Content "123" would be parsed. The operator <<|> in uu-parsinglib has the same greedy behaviour.
I would like to disambiguate the grammar so that the first result ([Open a, Content "123", Close a]) is selected (longest matching rule). I suppose that this may be done using <<|> to define a greedySome to be used in in pContent, however I am wondering whether there is another way to do this, without using such operator <<|>.
The thing to do it to use the pList combinator which is greedy (the pList_ng counterpart is non-greedy): pContent = Content <$> pList (satisfy (/= '<')) An even better approach would be to reflect the structure of your HTML in your parser: pContent = tag_semantics <$> pOpenTag <*> pContent <* pCloseTag In case you re sure about the correctness of your HTML you might want to enforce, by using a monadic construct to check for the closing tag to correspons to the opening tag (note that I am in general not in favour of using monads, since they make the abstrcat interpretation going on expensive): pTagged = do t <- pOpentag c <- pContents pClosetag t return (Tagged t c) (of course with appropriate definitions for pOpentag and pCloseTag. Finally the module BasicInstances contains a combinator pMunch which you could have used: pContent = Content <$> pMunch (/= '<') This is far more efficient since it operates directly at the state. Doaitse
Any help is appreciated.
All the best, Marco _______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe