a better way to do this

8 Apr 2009

      This is an early attempt to create some kind of parser, for text that is 
xml-like but not actually xml. This is probably a disaster by Haskell 
standards... If someone could point me in the direction of a better way of doing 
things, that would be great. I don't want to use the existing parser library, 
not at first, because I want to learn more from first principles (for now).

The input looks something like:

<entry>
   <field1> thingy </field1>
   <field2> other thingy </field2>
</entry>
<entry>
   ...
</entry>
...

where there are any number of entries. Each entry consists of a varied number of 
named fields. For now, the named fields can be anything--any names, in any 
order. Later I'll do sanity checking to ensure the right fields are there, to 
provide default values, etc.

This uses Data.Bytestring.Char8 for efficiency processing large files.

The output types are as follows:

type Bs = B.ByteString  -- alias
type Component = ( Bs, Bs ) -- one named field
type Entry = [ Component ] -- all named fields in one entry
type Doc = [ Entry ] -- all entries in the input document

The basic strategy is to create parsing functions, which take in a string 
(actually ByteString), and return an object, the remainder of the string, and an 
index. (The index indicates the position of the first character in the remainder 
of the string, which is useful for giving error messages.)

Top level function is called parseReqs ( "parse requirements" -- this is 
actually going to be used for a software requirements management project).

Here's the rest of the code:

-- types for regular expression matching
type Re3 = ( Bs, Bs, Bs )
type Re4 = ( Bs, Bs, Bs, [ Bs ] )

parseReqs :: Bs -> ( Doc, Bs, Int )
parseReqs buf = parseReqs' buf 0
parseReqs' :: Bs -> Int -> ( Doc, Bs, Int )
parseReqs' buf idx
   | B.null buf = ( [], buf, idx )
   | otherwise = case parseEntry buf idx  of
     (Just e, rem, remIdx)   ->
       let ( doc, rem', remIdx' ) = parseReqs' rem remIdx
       in ( e : doc, rem', remIdx' )
     (Nothing, rem, remIdx ) -> ( [], rem, remIdx )

parseEntry :: Bs -> Int -> ( Maybe Entry, Bs, Int )
parseEntry buf idx =
   let ( before, match, after ) = buf =~ "<entry>" :: Re3
       idx' = idx + B.length before + B.length match
   in if B.null match
        then ( Nothing, after, idx' )
        else let ( e, after', idx'' ) = parseEntryBody after idx'
             in ( Just e, after', idx'' )

parseEntryBody :: Bs -> Int -> ( Entry, Bs, Int )
parseEntryBody buf idx =
   let ( before, match, after ) = buf =~ "</entry>" :: Re3
       idx' = idx + B.length before + B.length match
   in if B.null match
        then error "Missing </entry>"
        else
         -- Note: index passed to parseEntryComponents is same as one passed
         -- into this function, because we pass 'before' to
         -- parseEntryComponents. Index returned from from this function is
         -- the one calculated above to occur at the start of 'after'
         ( parseEntryComponents before idx, after, idx' )

parseEntryComponents :: Bs -> Int -> Entry
parseEntryComponents buf idx =
   let ( before, match, after, groups ) = buf =~ B.pack "<([^>]+)>" :: Re4
       idx' = idx + B.length before + B.length match
   in if B.null match
        then []
        else
          let ( component, buf', idx'' ) =
                parseCompBody (head groups) after idx'
              components = parseEntryComponents buf' idx''
          in component : components

parseCompBody :: Bs -> Bs -> Int -> ( Component, Bs, Int )
parseCompBody compName buf idx =
   let ( before, match, after ) = buf =~
         (B.pack "") :: Re3
       idx' = idx + B.length before + B.length match
   in if B.null match
      then error ("No ending to component " ++ B.unpack compName)
      else ( ( compName, before ), after, idx' )

Michael P Mossey

Michael P Mossey

Heinrich Apfelmus

tags

participants (2)