Hi Ketil,

On Tue, Aug 17, 2010 at 12:09 PM, Ketil Malde <ketil@malde.org> wrote:

Johan Tibell <johan.tibell@gmail.com> writes:

> It's not clear to me that using UTF-16 internally does make Data.Text
> noticeably slower.

I haven't benchmarked it, but I'm fairly sure that, if you try to fit a
3Gbyte file (the Human genome, sayą), into a computer with 4Gbytes of
RAM, UTF-16 will be slower than UTF-8. Many applications will get away
with streaming over data, retaining only a small part, but some won't.

I'm not sure if this is a great example as genome data is probably much better stored in a vector (using a few bits per "letter"). I agree that whenever one data structure will fit in the available RAM and another won't the smaller will win. I just don't know if this case is worth spending weeks worth of work optimizing for. That's why I'd like to see benchmarks for more idiomatic use cases.

In other cases (e.g. processing CJK text, and perhap also
non-Latin1 text), I'm sure it'll be faster - but my (still
unsubstantiated) guess is that the difference will be much smaller, and
it'll be a case of winning some and losing some - and I'd also
conjecture that having 3Gb "real" text (i.e. natural language, as
opposed to text-formatted data) is rare.

I would like to verify this guess. In my personal experience it's really hard to guess which changes will lead to a noticeable performance improvement. I'm probably wrong more often than I'm right.