
This is an early attempt to create some kind of parser, for text that is xml-like but not actually xml. This is probably a disaster by Haskell standards... If someone could point me in the direction of a better way of doing things, that would be great. I don't want to use the existing parser library, not at first, because I want to learn more from first principles (for now). The input looks something like: <entry> <field1> thingy </field1> <field2> other thingy </field2> </entry> <entry> ... </entry> ... where there are any number of entries. Each entry consists of a varied number of named fields. For now, the named fields can be anything--any names, in any order. Later I'll do sanity checking to ensure the right fields are there, to provide default values, etc. This uses Data.Bytestring.Char8 for efficiency processing large files. The output types are as follows: type Bs = B.ByteString -- alias type Component = ( Bs, Bs ) -- one named field type Entry = [ Component ] -- all named fields in one entry type Doc = [ Entry ] -- all entries in the input document The basic strategy is to create parsing functions, which take in a string (actually ByteString), and return an object, the remainder of the string, and an index. (The index indicates the position of the first character in the remainder of the string, which is useful for giving error messages.) Top level function is called parseReqs ( "parse requirements" -- this is actually going to be used for a software requirements management project). Here's the rest of the code: -- types for regular expression matching type Re3 = ( Bs, Bs, Bs ) type Re4 = ( Bs, Bs, Bs, [ Bs ] ) parseReqs :: Bs -> ( Doc, Bs, Int ) parseReqs buf = parseReqs' buf 0 parseReqs' :: Bs -> Int -> ( Doc, Bs, Int ) parseReqs' buf idx | B.null buf = ( [], buf, idx ) | otherwise = case parseEntry buf idx of (Just e, rem, remIdx) -> let ( doc, rem', remIdx' ) = parseReqs' rem remIdx in ( e : doc, rem', remIdx' ) (Nothing, rem, remIdx ) -> ( [], rem, remIdx ) parseEntry :: Bs -> Int -> ( Maybe Entry, Bs, Int ) parseEntry buf idx = let ( before, match, after ) = buf =~ "<entry>" :: Re3 idx' = idx + B.length before + B.length match in if B.null match then ( Nothing, after, idx' ) else let ( e, after', idx'' ) = parseEntryBody after idx' in ( Just e, after', idx'' ) parseEntryBody :: Bs -> Int -> ( Entry, Bs, Int ) parseEntryBody buf idx = let ( before, match, after ) = buf =~ "</entry>" :: Re3 idx' = idx + B.length before + B.length match in if B.null match then error "Missing </entry>" else -- Note: index passed to parseEntryComponents is same as one passed -- into this function, because we pass 'before' to -- parseEntryComponents. Index returned from from this function is -- the one calculated above to occur at the start of 'after' ( parseEntryComponents before idx, after, idx' ) parseEntryComponents :: Bs -> Int -> Entry parseEntryComponents buf idx = let ( before, match, after, groups ) = buf =~ B.pack "<([^>]+)>" :: Re4 idx' = idx + B.length before + B.length match in if B.null match then [] else let ( component, buf', idx'' ) = parseCompBody (head groups) after idx' components = parseEntryComponents buf' idx'' in component : components parseCompBody :: Bs -> Bs -> Int -> ( Component, Bs, Int ) parseCompBody compName buf idx = let ( before, match, after ) = buf =~ (B.pack "" `mappend` compName `mappend` B.pack ">") :: Re3 idx' = idx + B.length before + B.length match in if B.null match then error ("No ending to component " ++ B.unpack compName) else ( ( compName, before ), after, idx' )