I am trying to parse a semi structured text with parsec that basically should identify sections. Each section starts with a headline and has an unstructured content - that's all.

Here's my attempt: https://gist.github.com/carlohamalainen/5087207

{-# LANGUAGE FlexibleContexts #-}

import Text.Parsec

import Control.Applicative hiding ((<|>),many)

-- Example input:

top 1:

some text ... bla

top 2:

more text ... bla bla

data Top = Top String deriving (Show)

data Content = Content [String] deriving (Show)

data Section = Section Top Content deriving (Show)

headline = do

t <- many1 (noneOf ":\n")

char ':'

newline

return $ Top t

contentLine = do

x <- many (noneOf ":\n")

newline

return x

content = do

line <- optionMaybe (try contentLine)

case line of Just x -> do xs <- content

return (x:xs)

_ -> return []

section = do

h <- headline

c <- Content <$> content

return $ Section h c

main = do

x <- readFile "simple.txt"

print $ parse (many section) "" x

Example run using your sample data:

$ runhaskell Simple.hs

Right [Section (Top "top 1") (Content ["","some text ... bla",""]),Section (Top "top 2") (Content ["","more text ... bla bla",""])]

Notes:

* I had to assume that a content line does not contain a ':', because that is the only way to distinguish a head-line (correct me if I'm wrong).

* The key was to use optionMaybe along with try; see the definition of content.

* I haven't tested this code on very large inputs.

* I slightly changed the definition of Content to have a list of Strings, one for each line. I'm sure this could be altered if you wanted to retain all whitespace.

* I am still new to Parsec, so don't take this as the definitive answer ;-)