
Johan Tibell
It's not clear to me that using UTF-16 internally does make Data.Text noticeably slower.
I haven't benchmarked it, but I'm fairly sure that, if you try to fit a 3Gbyte file (the Human genome, say¹), into a computer with 4Gbytes of RAM, UTF-16 will be slower than UTF-8. Many applications will get away with streaming over data, retaining only a small part, but some won't. In other cases (e.g. processing CJK text, and perhap also non-Latin1 text), I'm sure it'll be faster - but my (still unsubstantiated) guess is that the difference will be much smaller, and it'll be a case of winning some and losing some - and I'd also conjecture that having 3Gb "real" text (i.e. natural language, as opposed to text-formatted data) is rare. I think that *IF* we are aiming for a single, grand, unified text library to Rule Them All, it needs to use UTF-8. Alternatively, we can have different libraries with different representations for different purposes, where you'll get another few percent of juice by switching to the most appropriate. Currently the latter approach looks to be in favor, so if we can't have one single library, let us at least aim for a set of libraries with consistent interfaces and optimal performance. Data.Text is great for UTF-16, and I'd like to have something similar for UTF-8. Is all I'm trying to say. -k -- If I haven't seen further, it is by standing in the footprints of giants