RE: [Haskell-cafe] how to get started: a text application

From: Max Ischenko [mailto:max@ucmg.com.ua]
Well, yes. In Markdown, like in most other "rich-text" formats symbols are overloaded a lot. After all, it has to constrain itself to "plain text".
I'm going to try a "two-stage tokenization" (not sure how to name this correctly). Basically, first I'd split the raw text into "symbols" (like space, char, digit, left-bracket) and then turn these symbols into tokens (like paragraph, reference, start bold text, end bold text, etc.)
Markdown looks a lot like Wiki source to me i.e. it looks like the text source for a Wiki page. It seems to serve the same purpose i.e. well-formatted plain text intended for conversion to HTML. Many (most?) Wiki engines use straightforward regex substitution to convert the text source into HTML, rather than implement a lexer/parser/pretty-printer combination. Obviously this makes for a fairly simple implementation. Mind you, some of the regex's are quite complex... See, for example: Source for Moin-moin, which runs the Haskell wiki: http://cvs.sf.net/viewcvs.py/moin/MoinMoin/parser/wiki.py?view=markup Original c2.com wiki (actual source a bit hard to find): http://www.c2.com/cgi/wiki?TextFormattingRegularExpressions ... which leads to: http://www.c2.com/cgi/wiki?TextFormattingRegularExpressionsDiscussion http://www.c2.com/cgi/wiki?AlternativesToRegularExpressions ----------------------------------------- ***************************************************************** Confidentiality Note: The information contained in this message, and any attachments, may contain confidential and/or privileged material. It is intended solely for the person(s) or entity to which it is addressed. Any review, retransmission, dissemination, or taking of any action in reliance upon this information by persons or entities other than the intended recipient(s) is prohibited. If you received this in error, please contact the sender and delete the material from any computer. *****************************************************************

Bayley, Alistair wrote:
Markdown looks a lot like Wiki source to me i.e. it looks like the text source for a Wiki page. It seems to serve the same purpose i.e. well-formatted plain text intended for conversion to HTML.
Exactly.
Many (most?) Wiki engines use straightforward regex substitution to convert the text source into HTML, rather than implement a lexer/parser/pretty-printer combination. Obviously this makes for a fairly simple implementation. Mind you, some of the regex's are quite complex...
You're right, even the Markdown itself implemented that way (Perl). Of course, building unnecessarily layers of abstraction is no good but my goals are more educational that practial. Plus, sometimes you need to complicate things to simplify them. ;-)

At 16:14 25/06/04 +0300, Max Ischenko wrote:
Many (most?) Wiki engines use straightforward regex substitution to convert the text source into HTML, rather than implement a lexer/parser/pretty-printer combination. Obviously this makes for a fairly simple implementation. Mind you, some of the regex's are quite complex...
You're right, even the Markdown itself implemented that way (Perl). Of course, building unnecessarily layers of abstraction is no good but my goals are more educational that practial. Plus, sometimes you need to complicate things to simplify them. ;-)
On reflection, I think there's a strong case for doing it this way (i.e. with a separate tokenizer) in Haskell, even if the tokenization is very simple, because it helps to separate some of the character-level issues from the remaining program logic. Any spurious detail that can be separated from the core logic makes the core easier to understand. Divide and rule! And, if I understand correctly, Haskell's lazy evaluation should mean that there's little or no penalty for doing this, even though it looks as if you're generating substantial intermediate data. ... BTW, for a project like this, be very aware of the cost of using ++ to append to a sequence. Look at the ShowS type in the standard Prelude (PreludeText). I also made some notes about this [1]. #g -- [1] http://www.ninebynine.org/Software/Learning-Haskell-Notes.html#UsingShowS ------------ Graham Klyne For email: http://www.ninebynine.org/#Contact

Graham Klyne wrote:
On reflection, I think there's a strong case for doing it this way (i.e. with a separate tokenizer) in Haskell, even if the tokenization is very simple, because it helps to separate some of the character-level issues from the remaining program logic. Any spurious detail that can be separated from the core logic makes the core easier to understand.
Yep.
Divide and rule! And, if I understand correctly, Haskell's lazy evaluation should mean that there's little or no penalty for doing this, even though it looks as if you're generating substantial intermediate data.
Fine. Though I'm not concerned with a perfomance, at least not until I get a first working version. ;)
BTW, for a project like this, be very aware of the cost of using ++ to append to a sequence. Look at the ShowS type in the standard Prelude (PreludeText). I also made some notes about this [1].
[1] http://www.ninebynine.org/Software/Learning-Haskell-Notes.html#UsingShowS
I skimmed through, and the whole page looks very instructive, thanks!
participants (3)
-
Bayley, Alistair
-
Graham Klyne
-
Max Ischenko