2013/3/5 Carlo Hamalainen <carlo.hamalainen@gmail.com>

On Mon, Mar 4, 2013 at 1:44 AM, Immanuel Normann <immanuel.normann@googlemail.com> wrote:

I am trying to parse a semi structured text with parsec that basically should identify sections. Each section starts with a headline and has an unstructured content - that's all.

Here's my attempt: https://gist.github.com/carlohamalainen/5087207

{-# LANGUAGE FlexibleContexts #-}

import Text.Parsec
import Control.Applicative hiding ((<|>),many)

-- Example input:

{-

top 1:

some text ... bla

top 2:

more text ... bla bla

-}

data Top = Top String deriving (Show)

data Content = Content [String] deriving (Show)
data Section = Section Top Content deriving (Show)

headline = do
t <- many1 (noneOf ":\n")
char ':'

newline

return $ Top t

contentLine = do
x <- many (noneOf ":\n")
newline

return x

content = do
line <- optionMaybe (try contentLine)

case line of Just x -> do xs <- content
return (x:xs)

_ -> return []

section = do

h <- headline
c <- Content <$> content
return $ Section h c

main = do
x <- readFile "simple.txt"
print $ parse (many section) "" x

Example run using your sample data:

$ runhaskell Simple.hs

Right [Section (Top "top 1") (Content ["","some text ... bla",""]),Section (Top "top 2") (Content ["","more text ... bla bla",""])]

Notes:

* I had to assume that a content line does not contain a ':', because that is the only way to distinguish a head-line (correct me if I'm wrong).

* The key was to use optionMaybe along with try; see the definition of content.

* I haven't tested this code on very large inputs.

* I slightly changed the definition of Content to have a list of Strings, one for each line. I'm sure this could be altered if you wanted to retain all whitespace.

* I am still new to Parsec, so don't take this as the definitive answer ;-)

--
Carlo Hamalainen
http://carlo-hamalainen.net