Re: [Haskell-cafe] Re: String vs ByteString

17 Aug 2010

      ...
I agree, Data.Text is great.  Unfortunately, its internal use of UTF-16
makes it inefficient for many purposes.
In the first iteration of the Text package, UTF-16 was chosen because
it had a nice balance of arithmetic overhead and space.  The
arithmetic for UTF-8 started to have serious performance impacts in
situations where the entire document was outside ASCII (i.e. a Russian
or Arabic document), but UTF-16 was still relatively compact, compared
to both the UTF-32 and String alternatives.  This, however, obviously
does not represent your use case.   I don't know if your use case is
the more common one (though it seems likely).

The underlying principles of Text should work fine with UTF-8.  It has
changed a lot since its original writing (thanks to some excellent
tuning and maintenance by bos), including some more efficient binary
arithmetic.  The situation may have changed with respect to the
performance limitations of UTF-8, or there may be room for it and a
UTF-16 version.  Any takers for implementing a UTF-8 version and
comparing the two?
...
A large fraction - probably most - textual data isn't natural language
text, but data formatted in textual form, and these formats are
typically restricted to ASCII (except for a few text fields).
For instance, a typical project for me might be 10-100GB of data, mostly
in various text formats, "real" text only making up a few percent of
this.  The combined (all languages) Wikipedia is 2G words, probably less
than 20GB.
Being agnostic about string encoding - viz. treating it as bytes - works
okay, but it would be nice to allow Unicode in the bits that actually
are text, like string fields and labels and such.
Is your point that ASCII characters take up the same amount of space
(i.e. 16 bits) as higher code points? Do you have any comparisons that
quantify how much this affects your ability to process text in real
terms?  Does it make it too slow? Infeasible memory-wise?

Re: [Haskell-cafe] Re: String vs ByteString

Tom Harper