Re: [Haskell-cafe] Google Summer of Code: BlazeHTML RFC

On Thu, May 27, 2010 at 11:40 AM, Ivan Miljenovic wrote: On 27 May 2010 18:33, Michael Snoyman I don't do any string concatenation (look closely), I was very careful to
avoid it. I tried with lazy text as well: it was slower. This isn't
surprising, since lazy text- under the surface- is just a list of strict
text. And the benchmark itself already has a lazy list of strict text.
Using
lazy text would just be adding a layer of wrapping.
I don't know what you mean by "explicitly using Text values"; you mean
calling pack manually? That's really all that OverloadedStrings does.
You can try out lots of different variants on that benchmark. I did that
already, and found this to be the fastest version. Fair enough. Now that I think about it, I recall once trying to have
pretty generate Text values rather than String for graphviz (by using
fullRender, so it was still using String under the hood until it came
time to render) and it too was much slower than String (unfortunately,
I didn't record a patch with these changes so I can't just go back and
play with it anymore as I reverted them all :s). Maybe Bryan can chime in with some best-practices for using Text? Here's my guess at an explanation for what's happening in my benchmark: text will clearly beat String in memory usage, that's what it's designed
for. However, the compiler is still generating String values which are being
encoded to Text as runtime.
Now, this is the same process for bytestrings. However, bytestrings never
have to be decoded: the IO routines simply read the character buffer. In the
case of text, however, the encoded data must be decoded again to a
bytestring.
In other words, here's what I think the three different benchmarks are
really doing:
* String: generates a list of Strings, passes each String to a relatively
inefficient IO routine.
* ByteString: encodes Strings one by one into ByteStrings, generates a list
of these ByteStrings, and passes each ByteString to a very efficient IO
routine.
: Text: encodes Strings one by one into Texts, generates a list of these
Texts, calls a UTF-8 decoding function to decode each Text into a
ByteString, and passes each resulting ByteString to a very efficient IO
routine.
In the case of ASCII data to be output as UTF-8, uses the
Data.ByteString.Char8.pack function will most likely always be the most
efficient choice, and thus it seems like something BlazeHtml should support.
I'm considering releasing a Hamlet 0.3 based entirely on UTF-8 encoded
ByteStrings, but I'd also like to hear from Bryan about this.
Michael

On Thu, May 27, 2010 at 10:53 AM, Michael Snoyman
In other words, here's what I think the three different benchmarks are really doing:
* String: generates a list of Strings, passes each String to a relatively inefficient IO routine. * ByteString: encodes Strings one by one into ByteStrings, generates a list of these ByteStrings, and passes each ByteString to a very efficient IO routine. : Text: encodes Strings one by one into Texts, generates a list of these Texts, calls a UTF-8 decoding function to decode each Text into a ByteString, and passes each resulting ByteString to a very efficient IO routine.
If Text used UTF-8 internally rather than UTF-16 we could create Texts from string literals much more efficiently, in the same manner as done in Char8.pack for bytestrings: {-# RULES "FPS pack/packAddress" forall s . pack (unpackCString# s) = inlinePerformIO (B.unsafePackAddress s) #-} This rule skips the creation of an intermediate String when packing a string literal by having the created ByteString point directly to the memory GHC allocates (outside the heap) for the string literal. This rule could be added directly to a builder monoid for lazy Texts so that no copying is done at all. In addition, if Text was internally represented using UTF-8 encodeUtf8 would be free. Johan
participants (2)
-
Johan Tibell
-
Michael Snoyman