
I agree, Data.Text is great. Unfortunately, its internal use of UTF-16 makes it inefficient for many purposes.
In the first iteration of the Text package, UTF-16 was chosen because it had a nice balance of arithmetic overhead and space. The arithmetic for UTF-8 started to have serious performance impacts in situations where the entire document was outside ASCII (i.e. a Russian or Arabic document), but UTF-16 was still relatively compact, compared to both the UTF-32 and String alternatives. This, however, obviously does not represent your use case. I don't know if your use case is the more common one (though it seems likely). The underlying principles of Text should work fine with UTF-8. It has changed a lot since its original writing (thanks to some excellent tuning and maintenance by bos), including some more efficient binary arithmetic. The situation may have changed with respect to the performance limitations of UTF-8, or there may be room for it and a UTF-16 version. Any takers for implementing a UTF-8 version and comparing the two?
A large fraction - probably most - textual data isn't natural language text, but data formatted in textual form, and these formats are typically restricted to ASCII (except for a few text fields).
For instance, a typical project for me might be 10-100GB of data, mostly in various text formats, "real" text only making up a few percent of this. The combined (all languages) Wikipedia is 2G words, probably less than 20GB.
Being agnostic about string encoding - viz. treating it as bytes - works okay, but it would be nice to allow Unicode in the bits that actually are text, like string fields and labels and such.
Is your point that ASCII characters take up the same amount of space (i.e. 16 bits) as higher code points? Do you have any comparisons that quantify how much this affects your ability to process text in real terms? Does it make it too slow? Infeasible memory-wise?