[Haskell-cafe] Munging wiki articles with tagsoup

8 Sep 2008

      Hiya Neil. So recently I've been trying to come up with some automated system to turn The Monad Reader articles like those in http://sneezy.cs.nott.ac.uk/darcs/TMR/Issue11 into wiki-formatted articles for putting on Haskell.org. Thus far, I've had the most success with SVN Pandoc.

Pandoc does a good job - you can see an example conversion at http://haskell.org/haskellwiki/?title=User:Gwern/kenn&oldid=22808. Modulo the errors which are largely due to haskell.org problems and a few limitations in Pandoc (no comments, no real support for references), it's fine.

But Pandoc's author will not support <haskell></haskell> tags inasmuch as they are an extension to MediaWiki and not universal; he prefers <pre> or <pre class="haskell"> tags. He suggested I use TagSoup to convert them into <haskell> tags. Well, alright. They're tags, TagSoup does tags - seems natural.

After an hour, I came up with a nice clean little script:

----

import Text.HTML.TagSoup.Render
import Text.HTML.TagSoup

main :: IO ()
main = interact convertPre

convertPre :: String -> String
convertPre = renderTags . map convertToHaskell . canonicalizeTags . parseTags

convertToHaskell :: Tag -> Tag
convertToHaskell x
               | isTagOpenName  "pre" x = TagOpen  "haskell" (extractAttribs x)
               | isTagCloseName "pre" x = TagClose "haskell"
               | otherwise              = x
                             where
                               extractAttribs :: Tag -> [Attribute]
                               extractAttribs (TagOpen _ y) = y
                               extractAttribs _             = error "The impossible happened."

----

On an aside, may I note that TagSoup doesn't seem to support transformations particularly well? Or if it does, I didn't notice any examples. I spent most of my time just figuring out how to convert the 'x' from a <pre>stuff to <haskell>stuff. Also, it might be nice to define an 'interact' alike, which is (String -> String), and defined, I supposed, as 'interact f = renderTags . f . canonicalizeTags . parseTags'. Extraction functions would be good as well - you'd only need 3 groups, I think; 1 for the 2 items in TagOpen, 1 for TagPosition's 2 positions, and 1 which extracts the String from the rest.

Anyway, so my script seems to work. I ran the wiki output through it and this is the diff: http://haskell.org/haskellwiki/?title=User%3AGwern%2Fkenn&diff=22827&oldid=22811.

Ok, good, it replaces all the tags... But wait, what's all this other stuff? It is replacing all my apostrophes with '! No doubt this has something to do with XML/HTML/SGML or whatever, but it's not ideal. Even if it doesn't break the formatting (as I think it does), it's still cluttering up the source.

So, how can I fix this? Am I just barking up the wrong tree and should be writing a simple-minded search-and-replace sed script which replaces <pre> with <haskell>, </pre> with </haskell>...?

--
gwern
USS Enforcers SORO Morwenstow MOD Albright MI5 AOL 701 GCHQ

[Haskell-cafe] Munging wiki articles with tagsoup

Gwern Branwen