
On 13 Jan 2015, at 14:25 , Marco Vassena
wrote: The thing to do it to use the pList combinator which is greedy (the pList_ng counterpart is non-greedy):
pContent = Content <$> pList (satisfy (/= '<'))
Checking the source code pList is defined in terms of pFoldr, which uses opt, which eventually is defined in terms of <<|>.
An even better approach would be to reflect the structure of your HTML in your parser:
pContent = tag_semantics <$> pOpenTag <*> pContent <* pCloseTag
In case you re sure about the correctness of your HTML you might want to enforce, by using a monadic construct to check for the closing tag to correspons to the opening tag (note that I am in general not in favour of using monads, since they make the abstrcat interpretation going on expensive):
pTagged = do t <- pOpentag c <- pContents pClosetag t return (Tagged t c)
(of course with appropriate definitions for pOpentag and pCloseTag.
Unfortunately in html there are also empty tags, which don't need to be closed. For instance the line-break tag <br>: <h1> Line break tags are <br> not closed </h1>
The bigger picture is that I am trying to figure out what are the core constructs needed to define a parser, therefore I want to have a rather abstract interface. In my set of core constructs there are: <$> : (a -> b) -> f a -> f b <*> : f (a -> b) -> f a -> f b <|> : f a -> f a -> f a -- (symmetric choice) pure : a -> f a fail : f a pToken : f Char
Is it possible to define a parser that applies the longest matching rule using these constructs only? Or is it necessary to extend it with another primitive, for instance greedy choice <<|> ? (Note that f is abstract and it is not necessarily uu-parsinglib parsers).
Marco
I do not think so. As long as your grammar is essentially ambiguous you will get all parses. Using the amb combinator you can capture this, and avoid the runtime error. Then you can lateron decide which tree too take. I am afraid however that this will may lead to an exponential blowup in your case. If the set og <br>like tags is statically known they might be taken care of in a special way? by the parser? Doaitse
On 13/gen/2015, at 12.22, S. Doaitse Swierstra wrote:
On 13 Jan 2015, at 11:57 , Marco Vassena
wrote: In the uu-parsinglib the choice combinator (<|>) is symmetric and does not commit to any alternative. If the grammar being parsed is ambiguous, the corresponding parser will fail at run time on an ambiguous input.
Consider for instance this html parser.
pTag = pOpenTag <|> pCloseTag <|> pCommentTag <|> pContent
pContent = Content <$> some (satisfy (/= '<')) pHtml = some pTag
This parser will fail on "<a>123</a>", because this may be interpreted as: [ [Open a , Content "123", Close a], [Open a, Content "12", Content "3", Close a], [Open a, Content "1", Content "2", Content "3", Close a] ... ]
In other parsing libraries such as parsec the choice operator <|> is greedy and commits to the first alternative that makes any progress, so that some is greedy and Content "123" would be parsed. The operator <<|> in uu-parsinglib has the same greedy behaviour.
I would like to disambiguate the grammar so that the first result ([Open a, Content "123", Close a]) is selected (longest matching rule). I suppose that this may be done using <<|> to define a greedySome to be used in in pContent, however I am wondering whether there is another way to do this, without using such operator <<|>.
The thing to do it to use the pList combinator which is greedy (the pList_ng counterpart is non-greedy):
pContent = Content <$> pList (satisfy (/= '<'))
An even better approach would be to reflect the structure of your HTML in your parser:
pContent = tag_semantics <$> pOpenTag <*> pContent <* pCloseTag
In case you re sure about the correctness of your HTML you might want to enforce, by using a monadic construct to check for the closing tag to correspons to the opening tag (note that I am in general not in favour of using monads, since they make the abstrcat interpretation going on expensive):
pTagged = do t <- pOpentag c <- pContents pClosetag t return (Tagged t c)
(of course with appropriate definitions for pOpentag and pCloseTag.
Finally the module BasicInstances contains a combinator pMunch which you could have used:
pContent = Content <$> pMunch (/= '<')
This is far more efficient since it operates directly at the state.
Doaitse
Any help is appreciated.
All the best, Marco _______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe