On Mon, Mar 4, 2013 at 1:44 AM, Immanuel Normann <immanuel.normann@googlemail.com> wrote:
I am trying to parse a semi structured text with parsec that basically should identify sections. Each section starts with a headline and has an unstructured content - that's all.Here's my attempt: https://gist.github.com/carlohamalainen/5087207{-# LANGUAGE FlexibleContexts #-}import Text.Parsecimport Control.Applicative hiding ((<|>),many)-- Example input:{-top 1:some text ... bla
top 2:more text ... bla bla
-}data Top = Top String deriving (Show)data Content = Content [String] deriving (Show)data Section = Section Top Content deriving (Show)
headline = dot <- many1 (noneOf ":\n")char ':'newlinereturn $ Top t
contentLine = dox <- many (noneOf ":\n")newlinereturn xcontent = doline <- optionMaybe (try contentLine)case line of Just x -> do xs <- contentreturn (x:xs)_ -> return []section = doh <- headlinec <- Content <$> contentreturn $ Section h cmain = dox <- readFile "simple.txt"print $ parse (many section) "" xExample run using your sample data:$ runhaskell Simple.hsRight [Section (Top "top 1") (Content ["","some text ... bla",""]),Section (Top "top 2") (Content ["","more text ... bla bla",""])]Notes:* I had to assume that a content line does not contain a ':', because that is the only way to distinguish a head-line (correct me if I'm wrong).* The key was to use optionMaybe along with try; see the definition of content.* I haven't tested this code on very large inputs.* I slightly changed the definition of Content to have a list of Strings, one for each line. I'm sure this could be altered if you wanted to retain all whitespace.* I am still new to Parsec, so don't take this as the definitive answer ;-)--
Carlo Hamalainen
http://carlo-hamalainen.net