
On 15-01-13 08:25 AM, Marco Vassena wrote:
Unfortunately in html there are also empty tags, which don't need to be closed. For instance the line-break tag <br>: <h1> Line break tags are <br> not closed </h1>
The bigger picture is that I am trying to figure out what are the core constructs needed to define a parser, therefore I want to have a rather abstract interface. In my set of core constructs there are: <$> : (a -> b) -> f a -> f b <*> : f (a -> b) -> f a -> f b <|> : f a -> f a -> f a -- (symmetric choice) pure : a -> f a fail : f a pToken : f Char
Is it possible to define a parser that applies the longest matching rule using these constructs only? Or is it necessary to extend it with another primitive, for instance greedy choice <<|> ? (Note that f is abstract and it is not necessarily uu-parsinglib parsers).
You can parse HTML with no ambiguous results if you allow monadic bind (>>=) as well: pTag = pElement <|> pCommentTag <|> pContent pElement = do elemName <- pOpenTag elemContent <- pTag `manyTill` endElement elemName endElement elemName = string "" *> string elemName *> string ">" <|> lookahead (string "" *> some (satisfy (/= '>')) *> string ">") pContent = Content <$> some (satisfy (/= '<')) pHtml = some pTag Mind you, this code would not give you exactly the same parse tree as an HTML 5 browser would. That spec is a nightmare.