Extracting structured data in XML into records

Hi! I'm trying to extract HCards (http://microformats.org/wiki/hcard) from HTML documents. HCard is a microformat. Microformats is an attempt to add semantic information to XML documents without adding any new tags. This is done by adding semantic information in class attributes instead (see the 'testXml' string below). I'm trying to find a good way to extract HCards into Haskell records. To do this I need to map XML elements with certain attribute values onto record fields. Some of the elements are optional in the XML and I represent that using Maybe fields in my record. The order of the elements is not guaranteed only the way they are nested. This makes it more difficult to first extract the fields I want into a list of Strings and then map that onto my record since I need to tag each string with the value it represents. So my question is. How can I write the function 'extractElementsIntoRecords' below. Or, perhaps HXT is the wrong tool for the job and I should be trying to walk the DOM tree instead?
module HCard where
import Text.XML.HXT.Arrow
data HCard = HCard { familyName :: String, givenName :: String org :: Maybe String url :: Maybe String } deriving Show
parseHCards xml = runX $ parseXml xml
parseXml xml = readString [(a_parse_html, v_1)] xml >>> deep (hasClassName "vcard") >>> extractElementsIntoRecords
extractElementsIntoRecords = undefined
hasClassName s = hasAttrValue "class" (elem s . words)
testXml = "
" ++ " http://tantek.com/\">" ++ "
24 Feb 24 Feb4:44 a.m.On Saturday 24 February 2007 21:22, Johan Tibell wrote:
So my question is. How can I write the function 'extractElementsIntoRecords' below. Or, perhaps HXT is the wrong tool for the job and I should be trying to walk the DOM tree instead?
module HCard where
import Text.XML.HXT.Arrow
data HCard = HCard { familyName :: String, givenName :: String org :: Maybe String url :: Maybe String } deriving Show
parseHCards xml = runX $ parseXml xml
parseXml xml = readString [(a_parse_html, v_1)] xml >>> deep (hasClassName "vcard") >>> extractElementsIntoRecords
extractElementsIntoRecords = undefined
Perhaps something like the following (which is likely to be wrong seen I'm adlibing): extractElementsIntoRecords = findFName <+> findGName <+> findOrg <+> findURL where findX c = deep (hasName "span" >>> hasAttrValue "class" (== c)) >>> getChildren >>> getText findFName = findX "family-name" >>> arr Just findGName = findX "given-name" >>> arr Just findOrg = (findX "org" >>> arr Just) `withDefault` Nothing findURL = (deep (hasName "a" >>> hasAttrValue "class" (== "url)) >>> getAttrValue "href" >>> arr Just) `withDefault` Nothing and use the following at an appropriate place: composeHCard (Just fn:Just gn:morg:murl:xs) = (HCard fn gn morg murl):(compose xs) composeHCard _ = [] There's several other possibilities for dealing with bad data and simplifications you could do of course. Daniel
participants (2)
-
Daniel McAllansmith
-
Johan Tibell