Need help with learning Parsec

Dear gentle Haskellers, I was trying to whet my Haskell by trying out Parsec today to try and parse out XML. Here's the code I cam up with - I wanted some help with the "gettext" parser that I've written. I had to do a dummy "char ' ') in there just to satisfy the "many" used in the xml parser. I'd appreciate it very much if someone could give me some feedback. data XML = Node String [XML] | Body String deriving Show gettext = do x <- many (letter <|> digit ) if (length x) > 0 then return (Body x) else (char ' ' >> (return $ Body "")) xml :: Parser XML xml = do { name <- openTag ; innerXML <- many innerXML ; endTag name ; return (Node name innerXML) } innerXML = do x <- (try xml <|> gettext) return x openTag :: Parser String openTag = do char '<' content <- many (noneOf ">") char '>' return content endTag :: String -> Parser String endTag str = do char '<' char '/' string str char '>' return str h1 = parse xml "" "<a>A</a>" h2 = parse xml "" "<a><b>A</b></a>" h3 = parse xml "" "<a><b><c></c></b></a>" h4 = parse xml "" "<a><b></b><c></c></a>" Regards, Kashyap

Am 19.07.2012 14:53, schrieb C K Kashyap:
Dear gentle Haskellers,
I was trying to whet my Haskell by trying out Parsec today to try and parse out XML. Here's the code I cam up with -
I wanted some help with the "gettext" parser that I've written. I had to do a dummy "char ' ') in there just to satisfy the "many" used in the xml parser. I'd appreciate it very much if someone could give me some feedback.
You don't want empty bodies! So use many1 in gettext. gettext = fmap Body $ many1 $ letter <|> digit If you have spaces in your bodies, skip them or allow them with noneOf "<". HTH Christian
data XML = Node String [XML] | Body String deriving Show
gettext = do x <- many (letter <|> digit ) if (length x) > 0 then return (Body x) else (char ' ' >> (return $ Body ""))
xml :: Parser XML xml = do { name <- openTag ; innerXML <- many innerXML ; endTag name ; return (Node name innerXML) }
innerXML = do x <- (try xml <|> gettext) return x
openTag :: Parser String openTag = do char '<' content <- many (noneOf ">") char '>' return content
endTag :: String -> Parser String endTag str = do char '<' char '/' string str char '>' return str
h1 = parse xml "" "<a>A</a>" h2 = parse xml "" "<a><b>A</b></a>" h3 = parse xml "" "<a><b><c></c></b></a>" h4 = parse xml "" "<a><b></b><c></c></a>"
Regards, Kashyap
_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

gettext = (many1 $ noneOf "><") >>= (return . Body)
works for your case.
On Thu, Jul 19, 2012 at 6:37 PM, Christian Maeder
Am 19.07.2012 14:53, schrieb C K Kashyap:
Dear gentle Haskellers,
I was trying to whet my Haskell by trying out Parsec today to try and parse out XML. Here's the code I cam up with -
I wanted some help with the "gettext" parser that I've written. I had to do a dummy "char ' ') in there just to satisfy the "many" used in the xml parser. I'd appreciate it very much if someone could give me some feedback.
You don't want empty bodies! So use many1 in gettext.
gettext = fmap Body $ many1 $ letter <|> digit
If you have spaces in your bodies, skip them or allow them with noneOf "<".
HTH Christian
data XML = Node String [XML] | Body String deriving Show
gettext = do x <- many (letter <|> digit ) if (length x) > 0 then return (Body x) else (char ' ' >> (return $ Body ""))
xml :: Parser XML xml = do { name <- openTag ; innerXML <- many innerXML ; endTag name ; return (Node name innerXML) }
innerXML = do x <- (try xml <|> gettext) return x
openTag :: Parser String openTag = do char '<' content <- many (noneOf ">") char '>' return content
endTag :: String -> Parser String endTag str = do char '<' char '/' string str char '>' return str
h1 = parse xml "" "<a>A</a>" h2 = parse xml "" "<a><b>A</b></a>" h3 = parse xml "" "<a><b><c></c></b></a>" h4 = parse xml "" "<a><b></b><c></c></a>"
Regards, Kashyap
______________________________**_________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/**mailman/listinfo/haskell-cafehttp://www.haskell.org/mailman/listinfo/haskell-cafe
______________________________**_________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/**mailman/listinfo/haskell-cafehttp://www.haskell.org/mailman/listinfo/haskell-cafe
-- I drink I am thunk.

On Thu, Jul 19, 2012 at 06:45:05PM +0530, Sai Hemanth K wrote:
gettext = (many1 $ noneOf "><") >>= (return . Body)
You can simplify this to: import Control.Applicative hiding ((<|>)) gettext = Body <$> many1 (noneOf "><") And some of your other parsers can be simplified as well: innerXML = xml <|> gettext openTag :: Parser String openTag = char '<' *> many (noneOf ">") <* char '>' endTag :: String -> Parser String endTag str = string "" *> string str <* char '>'

On Thu, Jul 19, 2012 at 03:34:47PM +0200, Simon Hengel wrote:
openTag :: Parser String openTag = char '<' *> many (noneOf ">") <* char '>'
endTag :: String -> Parser String endTag str = string "" *> string str <* char '>'
Well yes, modified to what Christian Maeder just suggested. Cheers, Simon

Am 19.07.2012 15:41, schrieb Simon Hengel:
On Thu, Jul 19, 2012 at 03:34:47PM +0200, Simon Hengel wrote:
openTag :: Parser String openTag = char '<' *> many (noneOf ">") <* char '>'
if you disallow empty tags and "/" within tags, then you can avoid the notFollowedBy construct by: openTag = try (char '<' *> many1 (noneOf "/>")) <* char '>' C.
endTag :: String -> Parser String endTag str = string "" *> string str <* char '>'
Well yes, modified to what Christian Maeder just suggested.
Cheers, Simon

Thank you so much ... I've updated my monad version here -
https://github.com/ckkashyap/LearningPrograms/blob/master/Haskell/Parsing/xm...https://github.com/ckkashyap/LearningPrograms/blob/master/Haskell/Parsing/xm...
and the Applicative version here -
https://github.com/ckkashyap/LearningPrograms/blob/master/Haskell/Parsing/xm...
The applicative version however does not seem to work.
Is there a good tutorial that I can look up for Parsec - I am checking out
http://legacy.cs.uu.nl/daan/download/parsec/parsec.html but I am looking
for a tutorial where a complex parser would be built ground up.
Next I'd like to take care of escaped angular brackets.
Regards,
Kashyap
On Thu, Jul 19, 2012 at 7:40 PM, Christian Maeder
Am 19.07.2012 15:41, schrieb Simon Hengel:
On Thu, Jul 19, 2012 at 03:34:47PM +0200, Simon Hengel wrote:
openTag :: Parser String openTag = char '<' *> many (noneOf ">") <* char '>'
if you disallow empty tags and "/" within tags, then you can avoid the notFollowedBy construct by:
openTag = try (char '<' *> many1 (noneOf "/>")) <* char '>'
C.
endTag :: String -> Parser String endTag str = string "" *> string str <* char '>'
Well yes, modified to what Christian Maeder just suggested.
Cheers, Simon

I've updated the parser here -
https://github.com/ckkashyap/LearningPrograms/blob/master/Haskell/Parsing/xm...
The whole thing is less than 100 lines and it can handle comments as well.
I have an outstanding question - What's the second parameter of the parse
function really for?
Regards,
Kashyap
On Thu, Jul 19, 2012 at 8:31 PM, C K Kashyap
Thank you so much ... I've updated my monad version here -
https://github.com/ckkashyap/LearningPrograms/blob/master/Haskell/Parsing/xm...https://github.com/ckkashyap/LearningPrograms/blob/master/Haskell/Parsing/xm...
and the Applicative version here - https://github.com/ckkashyap/LearningPrograms/blob/master/Haskell/Parsing/xm...
The applicative version however does not seem to work.
Is there a good tutorial that I can look up for Parsec - I am checking out http://legacy.cs.uu.nl/daan/download/parsec/parsec.html but I am looking for a tutorial where a complex parser would be built ground up.
Next I'd like to take care of escaped angular brackets.
Regards, Kashyap
On Thu, Jul 19, 2012 at 7:40 PM, Christian Maeder < Christian.Maeder@dfki.de> wrote:
Am 19.07.2012 15:41, schrieb Simon Hengel:
On Thu, Jul 19, 2012 at 03:34:47PM +0200, Simon Hengel wrote:
openTag :: Parser String openTag = char '<' *> many (noneOf ">") <* char '>'
if you disallow empty tags and "/" within tags, then you can avoid the notFollowedBy construct by:
openTag = try (char '<' *> many1 (noneOf "/>")) <* char '>'
C.
endTag :: String -> Parser String endTag str = string "" *> string str <* char '>'
Well yes, modified to what Christian Maeder just suggested.
Cheers, Simon

On Sun, Jul 22, 2012 at 11:00 AM, C K Kashyap
What's the function to access it?
The function 'runParser' returns either a result or a ParseError. You can extract the error position with the 'errorPos' function, and then you can extract the name of the file from the position with 'sourceName'. The the 'Show' instance of ParseError does this. Antoine

Thanks a lot Antonie and Simon.
Regards,
Kashyap
On Mon, Jul 23, 2012 at 12:15 AM, Antoine Latter
On Sun, Jul 22, 2012 at 11:00 AM, C K Kashyap
wrote: What's the function to access it?
The function 'runParser' returns either a result or a ParseError. You can extract the error position with the 'errorPos' function, and then you can extract the name of the file from the position with 'sourceName'.
The the 'Show' instance of ParseError does this.
Antoine

Am 22.07.2012 17:21, schrieb C K Kashyap:
I've updated the parser here - https://github.com/ckkashyap/LearningPrograms/blob/master/Haskell/Parsing/xm...
The whole thing is less than 100 lines and it can handle comments as well.
This code is still not nice: Duplicate code in openTag and withoutExplictCloseTag. The "toplevel-try" in try withoutExplictCloseTag <|> withExplicitCloseTag should be avoided by factoring out the common prefix. Again, I would avoid notFollowedBy by using many1. tag <- try(char '<' >> many1 (letter <|> digit)) In quotedChar you do not only want to escape the quote but at least the backslash, too. You could allow to escape any character by a backslash using: quotedChar c = try (char '\\' >> anyChar) <|> noneOf [c, '\\'] Writing a separate parser stripLeadingSpaces is overkill. Just use "spaces >> parseXML" (or apply "dropWhile isSpace" to the input string) C. [...]

Thank you so much Christian for your feedback ... I shall incorporate them.
Regards,
Kashyap
On Mon, Jul 23, 2012 at 3:17 PM, Christian Maeder
Am 22.07.2012 17:21, schrieb C K Kashyap:
I've updated the parser here -
https://github.com/ckkashyap/**LearningPrograms/blob/master/** Haskell/Parsing/xml_3.hshttps://github.com/ckkashyap/LearningPrograms/blob/master/Haskell/Parsing/xm...
The whole thing is less than 100 lines and it can handle comments as well.
This code is still not nice: Duplicate code in openTag and withoutExplictCloseTag.
The "toplevel-try" in try withoutExplictCloseTag <|> withExplicitCloseTag should be avoided by factoring out the common prefix.
Again, I would avoid notFollowedBy by using many1.
tag <- try(char '<' >> many1 (letter <|> digit))
In quotedChar you do not only want to escape the quote but at least the backslash, too. You could allow to escape any character by a backslash using: quotedChar c = try (char '\\' >> anyChar) <|> noneOf [c, '\\']
Writing a separate parser stripLeadingSpaces is overkill. Just use "spaces >> parseXML"
(or apply "dropWhile isSpace" to the input string)
C.
[...]

Am 19.07.2012 15:14, schrieb Christian Maeder:
Am 19.07.2012 14:53, schrieb C K Kashyap:
innerXML = do x <- (try xml <|> gettext) return x
Omit "try" (and return). xml always starts with "<" whereas gettext never does.
I was wrong, you do not want to swallow an endTag as openTag. openTag should start with: try $ char '<' >> notFollowedBy (char '/') and endTag should start with: try $ string "" C.
C.

Am 19.07.2012 15:26, schrieb Christian Maeder:
Am 19.07.2012 15:14, schrieb Christian Maeder:
Am 19.07.2012 14:53, schrieb C K Kashyap:
innerXML = do x <- (try xml <|> gettext) return x
Omit "try" (and return). xml always starts with "<" whereas gettext never does.
I was wrong, you do not want to swallow an endTag as openTag.
openTag should start with: try $ char '<' >> notFollowedBy (char '/')
and endTag should start with: try $ string ""
Strictly, the try in endTag is not necessary (only in openTag) C.
participants (5)
-
Antoine Latter
-
C K Kashyap
-
Christian Maeder
-
Sai Hemanth K
-
Simon Hengel