On Mon, Mar 4, 2013 at 1:44 AM, Immanuel Normann <immanuel.normann@googlemail.com> wrote:
I am trying to parse a semi structured text with parsec that basically should identify sections. Each section starts with a headline and has an unstructured content - that's all.

Here's my attempt: https://gist.github.com/carlohamalainen/5087207

{-# LANGUAGE FlexibleContexts #-}

import Text.Parsec
import Control.Applicative hiding ((<|>),many)

-- Example input:

{-
top 1:

some text ... bla

top 2:

more text ... bla bla

-}

data Top = Top String deriving (Show)
data Content = Content [String] deriving (Show)
data Section = Section Top Content deriving (Show)

headline = do
    t <- many1 (noneOf ":\n")
    char ':'
    newline

    return $ Top t

contentLine = do
    x <- many (noneOf ":\n")
    newline
    return x

content = do
    line <- optionMaybe (try contentLine)

    case line of Just x -> do xs <- content
                              return (x:xs)
                 _      -> return []

section = do
    h <- headline
    c <- Content <$> content
    return $ Section h c

main = do
    x <- readFile "simple.txt"
    print $ parse (many section) "" x


Example run using your sample data:

$ runhaskell Simple.hs 
Right [Section (Top "top 1") (Content ["","some text ... bla",""]),Section (Top "top 2") (Content ["","more text ... bla bla",""])]

Notes:

* I had to assume that a content line does not contain a ':', because that is the only way to distinguish a head-line (correct me if I'm wrong).

* The key was to use optionMaybe along with try; see the definition of content.

* I haven't tested this code on very large inputs.

* I slightly changed the definition of Content to have a list of Strings, one for each line. I'm sure this could be altered if you wanted to retain all whitespace.

* I am still new to Parsec, so don't take this as the definitive answer ;-)

--
Carlo Hamalainen
http://carlo-hamalainen.net