On Wed, Aug 18, 2010 at 7:12 PM, Michael Snoyman <michael@snoyman.com> wrote:
On Wed, Aug 18, 2010 at 6:24 PM, Johan Tibell <johan.tibell@gmail.com> wrote:
 
Sorry, I thought I'd sent these out. While working on optimizing Hamlet I started playing around with the BigTable benchmark. I wrote two blog posts on the topic:

http://www.snoyman.com/blog/entry/bigtable-benchmarks/
http://www.snoyman.com/blog/entry/optimizing-hamlet/

Originally, Hamlet had been based on the text package; the huge slow-down introduced by text convinced me to migrate to bytestrings, and ultimately blaze-html/blaze-builder. It could be that these were flaws in text that are correctable and have nothing to do with UTF-16; however, it will be difficult to produce a benchmark purely on the UTF-8/UTF-16 divide. Using UTF-16 bytestrings would probably overstate the impact since it wouldn't be using Bryan's fusion logic.

Those are great. As Bryan mentioned we've already improved performance and I think I know how to improve it further.

I appreciate that it's difficult to show the UTF-8/UTF-16 divide. I think the approach we're trying at the moment is looking at benchmarks, improving performance, and repeating until we can't improve anymore. It could be the case that we get a benchmark where the performance difference between bytestring and text cannot be explained/fixed by factors other than changing the internal encoding. That would be strong evidence that we should try to switch the internal encoding. We haven't seen any such benchmarks yet.

As for blaze I'm not sure exactly how it deals with UTF-8 input. I tried to browse through the repo but could find that input ByteStrings are actually validated anywhere. If they're not it's a big generous to say that it deals with UTF-8 data, as it would really just be concatenating byte sequences, without validating them. We should ask Jasper about the current state.
 
I don't see any reason why Bryan wouldn't accept an UTF-8 patch if it was faster on some set of benchmarks (starting with the ones already in the library) that we agree on.

I think that's the main issue, and one that Duncan nailed on the head: we have to think about what are the important benchmarks. For Hamlet, I need fast UTF-8 bytestring generation. I don't care at all about algorithmic speed for split texts, as an example. My (probably uneducated) guess is that UTF-16 tends to perform many operations in memory faster since almost all characters are represented as 16 bits, while the big benefit for UTF-8 is in reading UTF-8 data, rendering UTF-8 data and decreased memory usage. But as I said, that's an (uneducated) guess.

I agree. Lets create some more benchmarks.

For example, lately I've been working on a benchmark, inspired by a real world problem, where I iterate over the lines in a ~500 MBs file, encoded using UTF-8 data, inserting each line in a Data.Map and do a bunch of further processing on it (such as splitting the strings into words). This tests text I/O throughput, memory overhead, performance of string comparison, etc.

We already have benchmarks for reading files (in UTF-8) in several different ways (lazy I/O and iteratee style folds).

Boil down the things you care about into a self contained benchmark and send it to this list or put it somewhere were we can retrieve it.

Cheers,
Johan