
Andrey,
Thanks a lot for your effort! I have the same suspect that the lookahead in
the content parser is the problem, but I don't know how to solve it either.
At least the I learned from your code that noneOf is also a quite useful
parser in this context which I have overlooked.
Anyway, if you find a solution it would be great! In the end the task
itself doesn't look very specific, but rather general: an alternation
between strictly (the headline in my case) and loosely (the content in my
case) structured text. It shouldn't be difficult to build a parser for such
a setting.
(btw. I am aware the my test parser would (or rather should) parse only the
first section. For testing this would be sufficient.)
2013/3/4 Andrey Chudnov
Immanuel, I tried but I couldn't figure it out. Here's a gist with my attempts and results so far: https://gist.github.com/achudnov/f3af65f11d5162c73064There, 'test' uses my attempt at specifying the parser, 'test2' uses yours. Note that your attempt wouldn't parse multiple sections -- for that you need to use 'many section' instead of just 'section' in 'parse' ('parseFromFile' in the original). I think what's going on is the lookahead is wrong, but I'm not sure how exactly. I'll give it another go tomorrow if I have time.
/Andrey
On 03/03/2013 05:16 PM, Immanuel Normann wrote:
Andrey,
Thanks for your attempt, but it doesn't seem to work. The easy part is the headline, but the content makes trouble.
Let me write the code a bit more explicit, so you can copy and paste it:
------------------------------------------ {-# LANGUAGE FlexibleContexts #-}
module Main where
import Text.Parsec
data Top = Top String deriving (Show) data Content = Content String deriving (Show) data Section = Section Top Content deriving (Show)
headline :: Stream s m Char => ParsecT s u m Top headline = manyTill anyChar (char ':' >> newline) >>= return . Top
content :: Stream s m Char => ParsecT s u m Content content = manyTill anyChar (try headline) >>= return . Content
section :: Stream s m Char => ParsecT s u m Section section = do {h <- headline; c <- content; return (Section h c)} ------------------------------------------
Assume the following example text is stored in "/tmp/test.txt": --------------------------- top 1:
some text ... bla
top 2:
more text ... bla bla ---------------------------
Now I run the section parser in ghci against the above mentioned example text stored in "/tmp/test.txt":
*Main> parseFromFile section "/tmp/test.txt" Right (Section (Top "top 1") (Content ""))
I don't understand the behaviour of the content parser here. Why does it return ""? Or perhaps more generally, I don't understand the manyTill combinator (though I read the docs).
Side remark: of cause for this little task it is probably to much effort to use parsec. However, my content in fact has an internal structure which I would like to parse further, but I deliberately abstracted from these internals as they don't effect my above stated problem.
Immanuel
2013/3/3 Andrey Chudnov
Immanuel, Since a heading always starts with a new line (and ends with a colon followed by a carriage return or just a colon?), I think it might be useful to first separate the input into lines and then classify them depending on whether it's a heading or not and reassemble them into the value you need. You don't even need parsec for that.
However, if you really want to use parsec, you can write something like (warning, not tested): many $ liftM2 Section headline content where headline = anyChar `manyTill` (char ':' >> spaces >> newline) content = anyChar `manyTill` (try $ newline >> headline)
/Andrey
On 3/3/2013 10:44 AM, Immanuel Normann wrote:
I am trying to parse a semi structured text with parsec that basically should identify sections. Each section starts with a headline and has an unstructured content - that's all. For instance, consider the following example text (inside the dashed lines):
---------------------------
top 1:
some text ... bla
top 2:
more text ... bla bla
---------------------------
This should be parsed into a structure like this:
[Section (Top 1) (Content "some text ... bla"), Section (Top 1) (Content "more text ... bla")]
Say, I have a parser "headline", but the content after a headline could be anything that is different from what "headline" parses. How could the "section" parser making use of "headline" look like? My idea would be to use the "manyTill" combinator, but I don"t find an easy solution.