a better way to do this

This is an early attempt to create some kind of parser, for text that is xml-like but not actually xml. This is probably a disaster by Haskell standards... If someone could point me in the direction of a better way of doing things, that would be great. I don't want to use the existing parser library, not at first, because I want to learn more from first principles (for now). The input looks something like: <entry> <field1> thingy </field1> <field2> other thingy </field2> </entry> <entry> ... </entry> ... where there are any number of entries. Each entry consists of a varied number of named fields. For now, the named fields can be anything--any names, in any order. Later I'll do sanity checking to ensure the right fields are there, to provide default values, etc. This uses Data.Bytestring.Char8 for efficiency processing large files. The output types are as follows: type Bs = B.ByteString -- alias type Component = ( Bs, Bs ) -- one named field type Entry = [ Component ] -- all named fields in one entry type Doc = [ Entry ] -- all entries in the input document The basic strategy is to create parsing functions, which take in a string (actually ByteString), and return an object, the remainder of the string, and an index. (The index indicates the position of the first character in the remainder of the string, which is useful for giving error messages.) Top level function is called parseReqs ( "parse requirements" -- this is actually going to be used for a software requirements management project). Here's the rest of the code: -- types for regular expression matching type Re3 = ( Bs, Bs, Bs ) type Re4 = ( Bs, Bs, Bs, [ Bs ] ) parseReqs :: Bs -> ( Doc, Bs, Int ) parseReqs buf = parseReqs' buf 0 parseReqs' :: Bs -> Int -> ( Doc, Bs, Int ) parseReqs' buf idx | B.null buf = ( [], buf, idx ) | otherwise = case parseEntry buf idx of (Just e, rem, remIdx) -> let ( doc, rem', remIdx' ) = parseReqs' rem remIdx in ( e : doc, rem', remIdx' ) (Nothing, rem, remIdx ) -> ( [], rem, remIdx ) parseEntry :: Bs -> Int -> ( Maybe Entry, Bs, Int ) parseEntry buf idx = let ( before, match, after ) = buf =~ "<entry>" :: Re3 idx' = idx + B.length before + B.length match in if B.null match then ( Nothing, after, idx' ) else let ( e, after', idx'' ) = parseEntryBody after idx' in ( Just e, after', idx'' ) parseEntryBody :: Bs -> Int -> ( Entry, Bs, Int ) parseEntryBody buf idx = let ( before, match, after ) = buf =~ "</entry>" :: Re3 idx' = idx + B.length before + B.length match in if B.null match then error "Missing </entry>" else -- Note: index passed to parseEntryComponents is same as one passed -- into this function, because we pass 'before' to -- parseEntryComponents. Index returned from from this function is -- the one calculated above to occur at the start of 'after' ( parseEntryComponents before idx, after, idx' ) parseEntryComponents :: Bs -> Int -> Entry parseEntryComponents buf idx = let ( before, match, after, groups ) = buf =~ B.pack "<([^>]+)>" :: Re4 idx' = idx + B.length before + B.length match in if B.null match then [] else let ( component, buf', idx'' ) = parseCompBody (head groups) after idx' components = parseEntryComponents buf' idx'' in component : components parseCompBody :: Bs -> Bs -> Int -> ( Component, Bs, Int ) parseCompBody compName buf idx = let ( before, match, after ) = buf =~ (B.pack "" `mappend` compName `mappend` B.pack ">") :: Re3 idx' = idx + B.length before + B.length match in if B.null match then error ("No ending to component " ++ B.unpack compName) else ( ( compName, before ), after, idx' )

Michael P Mossey wrote:
This is an early attempt to create some kind of parser, for text that is xml-like but not actually xml. This is probably a disaster by Haskell standards... If someone could point me in the direction of a better way of doing things, that would be great. I don't want to use the existing parser library, not at first, because I want to learn more from first principles (for now).
To follow up my own post, I'm working through chapter 10 of "Real World Haskell" right now, and then I'll look at chapter 14 (Monads). I think this is a better way of creating parsers. Maybe I'll look at Parsec eventually too, although I want to do a lot of stuff myself for the learning exprience. Thanks, Mike

Michael P Mossey wrote:
Michael P Mossey wrote:
This is an early attempt to create some kind of parser, for text that is xml-like but not actually xml. This is probably a disaster by Haskell standards... If someone could point me in the direction of a better way of doing things, that would be great. I don't want to use the existing parser library, not at first, because I want to learn more from first principles (for now).
To follow up my own post, I'm working through chapter 10 of "Real World Haskell" right now, and then I'll look at chapter 14 (Monads). I think this is a better way of creating parsers. Maybe I'll look at Parsec eventually too, although I want to do a lot of stuff myself for the learning exprience.
I don't think that doing it "the hard way by hand" is very enlightening, especially since you are using regular expressions anyway. (By the way, the latter are a niche in Haskell; any serious and also most non-serious parsing tasks are best done with parser combinators. The popularity of regular expressions in languages like Python or Perl is mainly due to the lack of more general parsing abstractions in these languages.) A much more rewarding introduction to parsing and parser combinators is probably to understand and experiment with their implementations, for instance by starting with Graham Hutton, Erik Meijer. Monadic Parser Combinators. http://www.cs.nott.ac.uk/~gmh/monparsing.ps Regards, apfelmus -- http://apfelmus.nfshost.com
participants (2)
-
Heinrich Apfelmus
-
Michael P Mossey