
Hiya Neil. So recently I've been trying to come up with some automated system to turn The Monad Reader articles like those in http://sneezy.cs.nott.ac.uk/darcs/TMR/Issue11 into wiki-formatted articles for putting on Haskell.org. Thus far, I've had the most success with SVN Pandoc. Pandoc does a good job - you can see an example conversion at http://haskell.org/haskellwiki/?title=User:Gwern/kenn&oldid=22808. Modulo the errors which are largely due to haskell.org problems and a few limitations in Pandoc (no comments, no real support for references), it's fine. But Pandoc's author will not support <haskell></haskell> tags inasmuch as they are an extension to MediaWiki and not universal; he prefers <pre> or <pre class="haskell"> tags. He suggested I use TagSoup to convert them into <haskell> tags. Well, alright. They're tags, TagSoup does tags - seems natural. After an hour, I came up with a nice clean little script: ---- import Text.HTML.TagSoup.Render import Text.HTML.TagSoup main :: IO () main = interact convertPre convertPre :: String -> String convertPre = renderTags . map convertToHaskell . canonicalizeTags . parseTags convertToHaskell :: Tag -> Tag convertToHaskell x | isTagOpenName "pre" x = TagOpen "haskell" (extractAttribs x) | isTagCloseName "pre" x = TagClose "haskell" | otherwise = x where extractAttribs :: Tag -> [Attribute] extractAttribs (TagOpen _ y) = y extractAttribs _ = error "The impossible happened." ---- On an aside, may I note that TagSoup doesn't seem to support transformations particularly well? Or if it does, I didn't notice any examples. I spent most of my time just figuring out how to convert the 'x' from a <pre>stuff to <haskell>stuff. Also, it might be nice to define an 'interact' alike, which is (String -> String), and defined, I supposed, as 'interact f = renderTags . f . canonicalizeTags . parseTags'. Extraction functions would be good as well - you'd only need 3 groups, I think; 1 for the 2 items in TagOpen, 1 for TagPosition's 2 positions, and 1 which extracts the String from the rest. Anyway, so my script seems to work. I ran the wiki output through it and this is the diff: http://haskell.org/haskellwiki/?title=User%3AGwern%2Fkenn&diff=22827&oldid=22811. Ok, good, it replaces all the tags... But wait, what's all this other stuff? It is replacing all my apostrophes with '! No doubt this has something to do with XML/HTML/SGML or whatever, but it's not ideal. Even if it doesn't break the formatting (as I think it does), it's still cluttering up the source. So, how can I fix this? Am I just barking up the wrong tree and should be writing a simple-minded search-and-replace sed script which replaces <pre> with <haskell>, </pre> with </haskell>...? -- gwern USS Enforcers SORO Morwenstow MOD Albright MI5 AOL 701 GCHQ

Hi Gwern, Sorry for not noticing this sooner, my haskell-cafe@ reading is somewhat behind right now!
After an hour, I came up with a nice clean little script:
----
import Text.HTML.TagSoup.Render import Text.HTML.TagSoup
main :: IO () main = interact convertPre
convertPre :: String -> String convertPre = renderTags . map convertToHaskell . canonicalizeTags . parseTags
convertToHaskell :: Tag -> Tag convertToHaskell x | isTagOpenName "pre" x = TagOpen "haskell" (extractAttribs x) | isTagCloseName "pre" x = TagClose "haskell" | otherwise = x where extractAttribs :: Tag -> [Attribute] extractAttribs (TagOpen _ y) = y extractAttribs _ = error "The impossible happened."
convertToHaskell (TagOpen "pre" atts) = TagOpen "haskell" atts convertToHaskell (TagClose "pre") = TagClose "haskell" convertToHaskell x = x Direct pattern matching is much easier and simpler.
Anyway, so my script seems to work. I ran the wiki output through it and this is the diff: http://haskell.org/haskellwiki/?title=User%3AGwern%2Fkenn&diff=22827&oldid=22811.
Ok, good, it replaces all the tags... But wait, what's all this other stuff? It is replacing all my apostrophes with '! No doubt this has something to do with XML/HTML/SGML or whatever, but it's not ideal. Even if it doesn't break the formatting (as I think it does), it's still cluttering up the source.
The escaping of ' is caused by renderTags, so instead call: renderTagsOptions (renderOptions{optEscape = (:[])}) For no escaping of any characters, or more likely do something like <,
and & conversions. See the docs: http://hackage.haskell.org/packages/archive/tagsoup/0.6/doc/html/Text-HTML-T...
Am I just barking up the wrong tree and should be writing a simple-minded search-and-replace sed script which replaces <pre> with <haskell>, </pre> with </haskell>...?
Not necessarily. If you literally just want to replace "<haskell>" with "<pre>" then sed is probably the easy choice. However, its quite likely you'll want to make more fixes, and tagsoup gives you the flexibility to extend in that direction. Thanks Neil

On 2008.09.09 19:49:49 +0100, Neil Mitchell
Hi Gwern,
Sorry for not noticing this sooner, my haskell-cafe@ reading is somewhat behind right now!
NP. I'm in no hurry; this TMR thing is an side project of mine, and I still haven't figured out how to get references/pandoc/citeproc-hs to work together, and I want to get them to work before I actually start uploading any converted articles.
convertToHaskell (TagOpen "pre" atts) = TagOpen "haskell" atts convertToHaskell (TagClose "pre") = TagClose "haskell" convertToHaskell x = x
Direct pattern matching is much easier and simpler.
That is very nice! Now the whole thing is like 5 lines of actual code. Once again, TagSoup wins.
The escaping of ' is caused by renderTags, so instead call:
renderTagsOptions (renderOptions{optEscape = (:[])})
Thanks.
For no escaping of any characters, or more likely do something like <,
and & conversions. See the docs: http://hackage.haskell.org/packages/archive/tagsoup/0.6/doc/html/Text-HTML-T...
Well, I did look at that Haddock page, as well as the others. But honestly, just a bare line like 'renderTagsOptions :: RenderOptions -> [Tag] -> String' doesn't help me - it doesn't tell me that 'that's default behavior, but you can override it thusly'.
Am I just barking up the wrong tree and should be writing a simple-minded search-and-replace sed script which replaces <pre> with <haskell>, </pre> with </haskell>...?
Not necessarily. If you literally just want to replace "<haskell>" with "<pre>" then sed is probably the easy choice. However, its quite likely you'll want to make more fixes, and tagsoup gives you the flexibility to extend in that direction.
Thanks
Neil
Hm hm. I see; the TagSoup way is more powerful in the long run. -- gwern blackjack NAVSVS Koancho Counter Merlin JICS 510 fuses JICC y

Hi
For no escaping of any characters, or more likely do something like <,
and & conversions. See the docs: http://hackage.haskell.org/packages/archive/tagsoup/0.6/doc/html/Text-HTML-T...
Well, I did look at that Haddock page, as well as the others. But honestly, just a bare line like 'renderTagsOptions :: RenderOptions -> [Tag] -> String' doesn't help me - it doesn't tell me that 'that's default behavior, but you can override it thusly'.
The lack of sufficient documentation is a bug. I've filed it at: http://code.google.com/p/ndmitchell/issues/detail?id=91 If someone wants to write the documentation and submit a patch, that would be great. Otherwise, I'll fix it at some unknown point in the future. Thanks Neil
participants (2)
-
Gwern Branwen
-
Neil Mitchell