Re: [Haskell-cafe] Re: String vs ByteString

17 Aug 2010

      Hi Ketil,

On Tue, Aug 17, 2010 at 12:09 PM, Ketil Malde  wrote:
...
Johan Tibell  writes:
...
It's not clear to me that using UTF-16 internally does make Data.Text
noticeably slower.
I haven't benchmarked it, but I'm fairly sure that, if you try to fit a
3Gbyte file (the Human genome, say¹), into a computer with 4Gbytes of
RAM, UTF-16 will be slower than UTF-8.  Many applications will get away
with streaming over data, retaining only a small part, but some won't.
I'm not sure if this is a great example as genome data is probably much
better stored in a vector (using a few bits per "letter"). I agree that
whenever one data structure will fit in the available RAM and another won't
the smaller will win. I just don't know if this case is worth spending weeks
worth of work optimizing for. That's why I'd like to see benchmarks for more
idiomatic use cases.
...
In other cases (e.g. processing CJK text, and perhap also
non-Latin1 text), I'm sure it'll be faster - but my (still
unsubstantiated) guess is that the difference will be much smaller, and
it'll be a case of winning some and losing some - and I'd also
conjecture that having 3Gb "real" text (i.e. natural language, as
opposed to text-formatted data) is rare.
I would like to verify this guess. In my personal experience it's really
hard to guess which changes will lead to a noticeable performance
improvement. I'm probably wrong more often than I'm right.

Cheers,
Johan