
Hi Ketil,
On Tue, Aug 17, 2010 at 12:09 PM, Ketil Malde
Johan Tibell
writes: It's not clear to me that using UTF-16 internally does make Data.Text noticeably slower.
I haven't benchmarked it, but I'm fairly sure that, if you try to fit a 3Gbyte file (the Human genome, say¹), into a computer with 4Gbytes of RAM, UTF-16 will be slower than UTF-8. Many applications will get away with streaming over data, retaining only a small part, but some won't.
I'm not sure if this is a great example as genome data is probably much better stored in a vector (using a few bits per "letter"). I agree that whenever one data structure will fit in the available RAM and another won't the smaller will win. I just don't know if this case is worth spending weeks worth of work optimizing for. That's why I'd like to see benchmarks for more idiomatic use cases.
In other cases (e.g. processing CJK text, and perhap also non-Latin1 text), I'm sure it'll be faster - but my (still unsubstantiated) guess is that the difference will be much smaller, and it'll be a case of winning some and losing some - and I'd also conjecture that having 3Gb "real" text (i.e. natural language, as opposed to text-formatted data) is rare.
I would like to verify this guess. In my personal experience it's really hard to guess which changes will lead to a noticeable performance improvement. I'm probably wrong more often than I'm right. Cheers, Johan