
On Tue, Aug 17, 2010 at 9:08 AM, Ketil Malde
Benedikt Huber
writes: Despite of all this, I think the performance of the text package is very promising, and hope it will improve further!
I agree, Data.Text is great. Unfortunately, its internal use of UTF-16 makes it inefficient for many purposes.
It's not clear to me that using UTF-16 internally does make Data.Text noticeably slower. If we could get conclusive evidence that using UTF-16 hurts performance, we could look into changing the internal representation (a major undertaking). What Bryan and I need is benchmarks showing where Data.Text is performing poorly, compare to String or ByteString, so we can investigate the cause(s). Hypothesis are a good starting point for performance improvements, but they're not enough. We need benchmarks and people looking at profiling and compiler output to really understand what's going on. For example, how many know that the Handle implementations copies the input first into a mutable buffer and then into a Text value, for reads less than the buffer size (8k if I remember correctly). One of these copies could be avoided. How do we know that it's using UTF-16 that's our current performance bottleneck and not this extra copy? We need to benchmark, change the code, and then benchmark again. Perhaps the outcome of all the benchmarking and investigation is indeed that UTF-16 is a problem; then we can change the internal encoding. But there are other possibilities, like poorly laid out branches in the generated code. We need to understand what's going on if we are to make progress. A large fraction - probably most - textual data isn't natural language
text, but data formatted in textual form, and these formats are typically restricted to ASCII (except for a few text fields).
For instance, a typical project for me might be 10-100GB of data, mostly in various text formats, "real" text only making up a few percent of this. The combined (all languages) Wikipedia is 2G words, probably less than 20GB.
I think this is an important observation. Cheers, Johan