On Tue, Aug 17, 2010 at 9:30 PM, Donn Cave <donn@avvanta.com> wrote:
Quoth John Millikin <jmillikin@gmail.com>,

> Ruby, which has an enormous Japanese userbase, solved the problem by
> essentially defining Text = (Encoding, ByteString), and then
> re-implementing text logic for each encoding. This allows very
> efficient operation with every possible encoding, at the cost of
> increased complexity (caching decoded characters, multi-byte handling,
> etc).

Ruby actually comes from the CJK world in a way, doesn't it?

Even if efficient per-encoding manipulation is a tough nut to crack,
it at least avoids the fixed cost of bulk decoding, so an application
designer doesn't need to  think about the pay-off for a correct text
approach vs. `binary'/ASCII, and the language/library designer doesn't
need to think about whether genome data is a representative case etc.

Remember that the cost of decoding is O(n) no matter what encoding is used internally as you always have to validate when going from  ByteString to Text. If the external and internal encoding don't match then you also have to copy the bytes into a new buffer, but that is only one allocation (a pointer increment with a semi-space collector) and the copy is cheap since the data is in cache.

-- Johan