ANN: cheapskate 0.1, markdown parser

I've released a new markdown library on Hackage: http://hackage.haskell.org/package/cheapskate This library is designed to be used in web applications. It is small, accurate, and fast, in pure Haskell with few dependencies. All output is sanitized through a whitelist by default. It is designed to have performance that varies linearly with the input size, even with garbage input. To illustrate: % head -c 100000 /dev/urandom | iconv -f latin1 -t utf-8 | time cheapskate >/dev/null cheapskate > /dev/null 0.15s user 0.01s system 82% cpu 0.199 total % head -c 1000000 /dev/urandom | iconv -f latin1 -t utf-8 | time cheapskate >/dev/null cheapskate > /dev/null 1.53s user 0.03s system 88% cpu 1.770 total % head -c 10000000 /dev/urandom | iconv -f latin1 -t utf-8 | time cheapskate >/dev/null cheapskate > /dev/null 15.50s user 0.20s system 89% cpu 17.632 total This is a test that many markdown parsers fail (including my own pandoc and the markdown package on Hackage). Performance is about seven times faster than pandoc (with five times less memory used), and about 25% faster than the markdown package on Hackage. Generic functions are provided that allow transformations of the AST prior to rendering (e.g., promotion of headers, insertion of syntax highlighting, or the conversion of specially marked code blocks into diagrams). Several markdown extensions are supported, and sensible decisions have been made about several aspects of markdown syntax that are left vague by John Gruber's specification. For details, see the README at https://github.com/jgm/cheapskate.

Looks like an excellent library! How did you manage to maintain linear
growth in running time? Seems like quite an achievement.
On Mon, Jan 6, 2014 at 11:15 AM, John MacFarlane
I've released a new markdown library on Hackage: http://hackage.haskell.org/package/cheapskate
This library is designed to be used in web applications. It is small, accurate, and fast, in pure Haskell with few dependencies. All output is sanitized through a whitelist by default. It is designed to have performance that varies linearly with the input size, even with garbage input. To illustrate:
% head -c 100000 /dev/urandom | iconv -f latin1 -t utf-8 | time cheapskate >/dev/null cheapskate > /dev/null 0.15s user 0.01s system 82% cpu 0.199 total % head -c 1000000 /dev/urandom | iconv -f latin1 -t utf-8 | time cheapskate >/dev/null cheapskate > /dev/null 1.53s user 0.03s system 88% cpu 1.770 total % head -c 10000000 /dev/urandom | iconv -f latin1 -t utf-8 | time cheapskate >/dev/null cheapskate > /dev/null 15.50s user 0.20s system 89% cpu 17.632 total
This is a test that many markdown parsers fail (including my own pandoc and the markdown package on Hackage).
Performance is about seven times faster than pandoc (with five times less memory used), and about 25% faster than the markdown package on Hackage.
Generic functions are provided that allow transformations of the AST prior to rendering (e.g., promotion of headers, insertion of syntax highlighting, or the conversion of specially marked code blocks into diagrams).
Several markdown extensions are supported, and sensible decisions have been made about several aspects of markdown syntax that are left vague by John Gruber's specification. For details, see the README at https://github.com/jgm/cheapskate.
_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

+++ Justin Bailey [Jan 06 14 13:07 ]:
Looks like an excellent library! How did you manage to maintain linear growth in running time? Seems like quite an achievement.
I tried very hard to avoid any backtracking. Parsing occurs in two phases. In the first phase, we process each line of the input in turn (this could even be done with streaming input; we never need to "rewind"). Each line transforms a "container stack," which container stack represents the nested containers, e.g. list items or blockquotes, that are open at that point in the document. Depending on the contents of the line, we might add new containers to the stack and/or close some of the open containers, adding them as children to the elements below them. The end of the line, after the bits that indicate container structure, will generally be added as a leaf element (text line or blank line) to the top container. I had to make a couple small changes in markdown syntax to make this line-by-line strategy feasible. If an opening code fence isn't matched by a closing code fence, the whole remainder of the document gets parsed as code. (Otherwise we'd have to backtrack an indefinite number of lines.) There is a also a change in how raw HTML works (documented in the README). A raw HTML block begins with an opening block HTML tag, and ends, not with a matching closing tag, but with the next blank line. An advantage of this system is that we can easily include markdown content in HTML tags (just surround it with blank lines), yet we can still have literal HTML blocks (just make sure they don't contain blanks). So I think this is superior to Gruber's original specification for raw HTML blocks, not just from a parsing point of view but from an expressive point of view. After we've read all the input, we close the container stack. We now have a Document element that represents the container structure and the text lines under each container. We also have a reference map with the link references. You can inspect this structure using 'cheapskate --debug'; here's an example: (Document ListItem {markerColumn = 1, padding = 3, listType = Numbered PeriodFollowing 1} "Foo" ListItem {markerColumn = 3, padding = 2, listType = Bullet '-'} "bar" BlankLine "" "baz" BlankLine "" ListItem {markerColumn = 1, padding = 3, listType = Numbered PeriodFollowing 2} "Bim",fromList []) In the second phase, we walk this container structure and produce a proper AST. This involves assembling list items into lists and assembling lists of text lines into paragraphs, parsing their contents as markdown inline elements, using the reference map to resolve reference links. Inline parsing uses parser combinators. I avoid backtracking by using fallbacks. For example, if we start parsing an emphasized section beginning with '*' and don't find the closing '*', we don't backtrack; instead, we just emit a '*' and the content we've parsed so far. I'd love to make the library faster. It is the fastest pure Haskell implementation I'm aware of, but C libraries like discount are still quite a bit faster. (Note: often they skimp on proper unicode support, which accounts for some of this.) Profiling does not show me any particular bottleneck, but I'd welcome suggestions. John
On Mon, Jan 6, 2014 at 11:15 AM, John MacFarlane
wrote: I've released a new markdown library on Hackage: http://hackage.haskell.org/package/cheapskate
This library is designed to be used in web applications. It is small, accurate, and fast, in pure Haskell with few dependencies. All output is sanitized through a whitelist by default. It is designed to have performance that varies linearly with the input size, even with garbage input. To illustrate:
% head -c 100000 /dev/urandom | iconv -f latin1 -t utf-8 | time cheapskate >/dev/null cheapskate > /dev/null 0.15s user 0.01s system 82% cpu 0.199 total % head -c 1000000 /dev/urandom | iconv -f latin1 -t utf-8 | time cheapskate >/dev/null cheapskate > /dev/null 1.53s user 0.03s system 88% cpu 1.770 total % head -c 10000000 /dev/urandom | iconv -f latin1 -t utf-8 | time cheapskate >/dev/null cheapskate > /dev/null 15.50s user 0.20s system 89% cpu 17.632 total
This is a test that many markdown parsers fail (including my own pandoc and the markdown package on Hackage).
Performance is about seven times faster than pandoc (with five times less memory used), and about 25% faster than the markdown package on Hackage.
Generic functions are provided that allow transformations of the AST prior to rendering (e.g., promotion of headers, insertion of syntax highlighting, or the conversion of specially marked code blocks into diagrams).
Several markdown extensions are supported, and sensible decisions have been made about several aspects of markdown syntax that are left vague by John Gruber's specification. For details, see the README at https://github.com/jgm/cheapskate.
_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

Cool!
By the way, I've noticed that you've rolled your own parser combinator
library. May I ask you, what is the reason for that?
Are other parser libraries not fast enough for the needs?
Thanks
On Mon, Jan 6, 2014 at 11:15 PM, John MacFarlane
I've released a new markdown library on Hackage: http://hackage.haskell.org/package/cheapskate
This library is designed to be used in web applications. It is small, accurate, and fast, in pure Haskell with few dependencies. All output is sanitized through a whitelist by default. It is designed to have performance that varies linearly with the input size, even with garbage input. To illustrate:
% head -c 100000 /dev/urandom | iconv -f latin1 -t utf-8 | time cheapskate >/dev/null cheapskate > /dev/null 0.15s user 0.01s system 82% cpu 0.199 total % head -c 1000000 /dev/urandom | iconv -f latin1 -t utf-8 | time cheapskate >/dev/null cheapskate > /dev/null 1.53s user 0.03s system 88% cpu 1.770 total % head -c 10000000 /dev/urandom | iconv -f latin1 -t utf-8 | time cheapskate >/dev/null cheapskate > /dev/null 15.50s user 0.20s system 89% cpu 17.632 total
This is a test that many markdown parsers fail (including my own pandoc and the markdown package on Hackage).
Performance is about seven times faster than pandoc (with five times less memory used), and about 25% faster than the markdown package on Hackage.
Generic functions are provided that allow transformations of the AST prior to rendering (e.g., promotion of headers, insertion of syntax highlighting, or the conversion of specially marked code blocks into diagrams).
Several markdown extensions are supported, and sensible decisions have been made about several aspects of markdown syntax that are left vague by John Gruber's specification. For details, see the README at https://github.com/jgm/cheapskate.
_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
-- Sincerely yours, -- Daniil

+++ Daniil Frumin [Jan 08 14 05:45 ]:
Cool!
By the way, I've noticed that you've rolled your own parser combinator library. May I ask you, what is the reason for that? Are other parser libraries not fast enough for the needs?
I started out using attoparsec, but I really needed the ability to query source position, which attoparsec doesn't provide. I also added a way to peek at the character *before* the current position (peekLastChar), which simplified some of my parsers. Otherwise it's very similar to attoparsec, with similar performance.
participants (3)
-
Daniil Frumin
-
John MacFarlane
-
Justin Bailey