how to get started: a text application

Hi all, I'm going to try to implement a version of Markdown tool[1] in Haskell. The application is rather simple: take a text file with some (simple) mark-up embedded in it and turn it into another text file, this time with XHTML markup. I need some guidelines on how to get started. I'll have to struggle with both the language itself (as I am a newcomer) plus with libraries that I may need. If I to do this in some convenient language, like Python, I'd use a lexer to parse the input, a state machine to build some internal representation and some serializer to write this into XHTML. Which approaches could you suggest for that kind of application in Haskell? I'd especially glad to hear about some "idiomatic" or "native" ways to solve the problem. I've made a quick search and found tools like Parsec, HaXML, Happy and WASH/HTML. Not sure which ones I'll need. [1] - http://daringfireball.net/projects/markdown/ TIA, max.

At 16:53 22/06/04 +0300, Max Ischenko wrote:
Hi all,
I'm going to try to implement a version of Markdown tool[1] in Haskell. The application is rather simple: take a text file with some (simple) mark-up embedded in it and turn it into another text file, this time with XHTML markup.
Cool project!
I need some guidelines on how to get started. I'll have to struggle with both the language itself (as I am a newcomer) plus with libraries that I may need.
Ah, how to get started? I find myself wrestling with this each time I start a new Haskell project, or phase of a project.
If I to do this in some convenient language, like Python, I'd use a lexer to parse the input, a state machine to build some internal representation and some serializer to write this into XHTML.
These choices are all available in Haskell. The way you pit them together may be a little different (see below). It's bee a while since I peered at MarkDown, and I forget how the language looks, except that I recall that, by design, it's kept very simple and has very little "superfluous" syntactic structure. I think the first choice is whether to go for a separately identifiable lexing phase, rather than working directly from the raw text. Either might work, I think. The HaXml XML parser has a separate lexer, but it turns out that it's not always easy to get the tokenization right without having contextual information (e.g. from the syntax analyzer). (XML is rather messy in that way.) In Haskell, it's often reasonably efficient to construct a program as a composition of "filters", rather like a Unix command pipeline; lazy evaluation often means that data can "stream" through the filters and never exist in its entirety in an intermediate form. This immediately allows the program structure to be resolved into a number of smaller, independent pieces; e.g. tokenize :: String -> [Token] parse :: [Token] -> DocModel createXHTML :: DocModel -> Document -- (cf. HaXml) Then HaXml provides function that can generate textual XML. Thus the overall conversion function might look like: markdownToXHTML :: String -> String markdownToXHTML = show . document . createXHTML . parse . tokenize (where "document" is from the HaXml module Text.XML.HaXml.Pretty). For parsing of any complexity, I recommend Parsec: it has the advantage of being very well documented, and it helps to show how monads can be used to handle state information. The outline sketched above has at least one weakness, it doesn't provide any way to handle errors. This could be overcome by using Either as an error monad (see Control.Monad and Control.Monad.Error in the standard hierarchical libraries), and then using >>= in place of function composition (noting the reversal of component order): -- Get definition of (Either String) as an instance of Monad: import Control.Monad() import Control.Monad.Error() tokenize :: String -> Either String [Token] parse :: [Token] -> Either String DocModel createXHTML :: DocModel -> Either String Document markdownToXHTML :: String -> Either String String markdownToXHTML s = tokenize s >>= parse >>= createXHTML >>= (Right . show . document) Or, using the do-notation: markdownToXHTML s = do { ts <- tokenize s ; dm <- parse ts ; xd <- createXHTML ; return $ show (document xd) }
Which approaches could you suggest for that kind of application in Haskell? I'd especially glad to hear about some "idiomatic" or "native" ways to solve the problem.
I've made a quick search and found tools like Parsec, HaXML, Happy and WASH/HTML. Not sure which ones I'll need.
I recommend Parsec (if it fits your needs, of course). It's not clear to me whether you want to generate text directly from the document data, or via something like HaXml. HaXml does have a pretty printing function (though I have noticed some minor problems with it, though probably nothing you'd encounter with this application.) ... Just to check out my use of >>= and do-notation, I constructed a trivial complete program using both. [[ -- spike-composition.hs -- -- composition of functions; using Either as an error monad -- Get definition of (Either String) as an instance of Monad: import Control.Monad() import Control.Monad.Error() data Token = T { unT::Char } data DocModel = D String instance Show DocModel where show (D s) = s tokenize :: String -> Either String [Token] tokenize = Right . map T parse :: [Token] -> Either String DocModel parse = Right . D . reverse . map unT docToString1 s = tokenize s >>= parse >>= (Right . ("docToString1: "++) . show) docToString2 s = do { ts <- tokenize s ; dm <- parse ts ; return $ (("docToString2: "++) . show) dm } t1 = docToString1 "abc" -- == Right "docToString1: cba" t2 = docToString2 "def" -- == Right "docToString2: fed" ]] #g --
[1] - http://daringfireball.net/projects/markdown/
TIA, max. _______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
------------ Graham Klyne For email: http://www.ninebynine.org/#Contact

Graham Klyne wrote:
I think the first choice is whether to go for a separately identifiable lexing phase, rather than working directly from the raw text. Either might work, I think.
Fhe first option (tokenization) is more appealing to me.
The HaXml XML parser has a separate lexer, but it turns out that it's not always easy to get the tokenization right without having contextual information (e.g. from the syntax analyzer). (XML is rather messy in that way.)
Well, yes. In Markdown, like in most other "rich-text" formats symbols are overloaded a lot. After all, it has to constrain itself to "plain text". I'm going to try a "two-stage tokenization" (not sure how to name this correctly). Basically, first I'd split the raw text into "symbols" (like space, char, digit, left-bracket) and then turn these symbols into tokens (like paragraph, reference, start bold text, end bold text, etc.)
In Haskell, it's often reasonably efficient to construct a program as a composition of "filters", rather like a Unix command pipeline; lazy evaluation often means that data can "stream" through the filters and never exist in its entirety in an intermediate form. This immediately allows the program structure to be resolved into a number of smaller, independent pieces; e.g.
tokenize :: String -> [Token] parse :: [Token] -> DocModel createXHTML :: DocModel -> Document -- (cf. HaXml)
Yes, I have seen this pattern in tutorial materials and am inclined to use it.
Then HaXml provides function that can generate textual XML. Thus the overall conversion function might look like:
markdownToXHTML :: String -> String markdownToXHTML = show . document . createXHTML . parse . tokenize
That would be a good start, thanks!
(where "document" is from the HaXml module Text.XML.HaXml.Pretty).
For parsing of any complexity, I recommend Parsec: it has the advantage of being very well documented, and it helps to show how monads can be used to handle state information.
OK.
The outline sketched above has at least one weakness, it doesn't provide any way to handle errors. This could be overcome by using Either as an error monad (see Control.Monad and Control.Monad.Error in the standard hierarchical libraries), and then using >>= in place of function composition (noting the reversal of component order):
Uhm. Looks like error handling is very different from the imperative languages. I think I'll try to get the basic version without it first. On a related note, how can I debug my program along the way? I suspect I can't even use a print inside a function. [ error handling cut ]
Just to check out my use of >>= and do-notation, I constructed a trivial complete program using both.
Phew. Not sure how much time it will take to comprehend this "triviality". I'm not yet can grasp the monads and their applications. [ example skipped ] Thank you for your time and explanations. It were surely very helpful, esp. considering the fact I had exactly one reply to my post. ;-)

At 15:17 24/06/04 +0300, Max Ischenko wrote: [...]
The HaXml XML parser has a separate lexer, but it turns out that it's not always easy to get the tokenization right without having contextual information (e.g. from the syntax analyzer). (XML is rather messy in that way.)
Well, yes. In Markdown, like in most other "rich-text" formats symbols are overloaded a lot. After all, it has to constrain itself to "plain text".
I'm going to try a "two-stage tokenization" (not sure how to name this correctly). Basically, first I'd split the raw text into "symbols" (like space, char, digit, left-bracket) and then turn these symbols into tokens (like paragraph, reference, start bold text, end bold text, etc.)
Maybe that can work. But in raw text, as opposed to markup, some characters get treated differently. The XML lexer I mentioned aims to distinguish "free text" from markup and other formal structure, and returns arbitrarily long sections of free text as a single token. [...]
The outline sketched above has at least one weakness, it doesn't provide any way to handle errors. This could be overcome by using Either as an error monad (see Control.Monad and Control.Monad.Error in the standard hierarchical libraries), and then using >>= in place of function composition (noting the reversal of component order):
Uhm. Looks like error handling is very different from the imperative languages. I think I'll try to get the basic version without it first. On a related note, how can I debug my program along the way? I suspect I can't even use a print inside a function.
[ error handling cut ]
Just to check out my use of >>= and do-notation, I constructed a trivial complete program using both.
Phew. Not sure how much time it will take to comprehend this "triviality". I'm not yet can grasp the monads and their applications.
Using it is easy enough. In this case, when any component returns a (Left x) value (an error), that is propagated down the pipeline without further processing. But the (y) in a (Right y) is passed to the next component for processing. Fully understanding how monads work takes a little longer. It's an abstract idea that serves a surprising number of practical requirements. The issue you come up against in a functional language is that there's no way to "bale out" if you see an error, other than by aborting the program altogether, by returning (error x), or by treating the error condition as part of the return value and testing it at each stage. The monad machinery allows all that result-testing to be tucked away in the bind (>>=) operator. (There is a third way that involves catching exceptions, but that only works in the IO monad. There's a recent thread about that on the [Haskell] mailing list, subject "IO, exceptions and error handling".) #g --
[ example skipped ]
Thank you for your time and explanations. It were surely very helpful, esp. considering the fact I had exactly one reply to my post. ;-)
_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
------------ Graham Klyne For email: http://www.ninebynine.org/#Contact
participants (2)
-
Graham Klyne
-
Max Ischenko