
Carlo,
Thanks a lot! This looks very promising (though I have to test it for my
purpose more in depth). As you mention, the key seems to be the optionMaybe
combinator. Thanks for pointing to it.
Immanuel
2013/3/5 Carlo Hamalainen
On Mon, Mar 4, 2013 at 1:44 AM, Immanuel Normann < immanuel.normann@googlemail.com> wrote:
I am trying to parse a semi structured text with parsec that basically should identify sections. Each section starts with a headline and has an unstructured content - that's all.
Here's my attempt: https://gist.github.com/carlohamalainen/5087207
{-# LANGUAGE FlexibleContexts #-}
import Text.Parsec import Control.Applicative hiding ((<|>),many)
-- Example input:
{- top 1:
some text ... bla
top 2:
more text ... bla bla
-}
data Top = Top String deriving (Show) data Content = Content [String] deriving (Show) data Section = Section Top Content deriving (Show)
headline = do t <- many1 (noneOf ":\n") char ':' newline
return $ Top t
contentLine = do x <- many (noneOf ":\n") newline return x
content = do line <- optionMaybe (try contentLine)
case line of Just x -> do xs <- content return (x:xs) _ -> return []
section = do h <- headline c <- Content <$> content return $ Section h c
main = do x <- readFile "simple.txt" print $ parse (many section) "" x
Example run using your sample data:
$ runhaskell Simple.hs Right [Section (Top "top 1") (Content ["","some text ... bla",""]),Section (Top "top 2") (Content ["","more text ... bla bla",""])]
Notes:
* I had to assume that a content line does not contain a ':', because that is the only way to distinguish a head-line (correct me if I'm wrong).
* The key was to use optionMaybe along with try; see the definition of content.
* I haven't tested this code on very large inputs.
* I slightly changed the definition of Content to have a list of Strings, one for each line. I'm sure this could be altered if you wanted to retain all whitespace.
* I am still new to Parsec, so don't take this as the definitive answer ;-)
-- Carlo Hamalainen http://carlo-hamalainen.net