
On Tue, Aug 17, 2010 at 9:30 PM, Donn Cave
Quoth John Millikin
, Ruby, which has an enormous Japanese userbase, solved the problem by essentially defining Text = (Encoding, ByteString), and then re-implementing text logic for each encoding. This allows very efficient operation with every possible encoding, at the cost of increased complexity (caching decoded characters, multi-byte handling, etc).
Ruby actually comes from the CJK world in a way, doesn't it?
Even if efficient per-encoding manipulation is a tough nut to crack, it at least avoids the fixed cost of bulk decoding, so an application designer doesn't need to think about the pay-off for a correct text approach vs. `binary'/ASCII, and the language/library designer doesn't need to think about whether genome data is a representative case etc.
Remember that the cost of decoding is O(n) no matter what encoding is used internally as you always have to validate when going from ByteString to Text. If the external and internal encoding don't match then you also have to copy the bytes into a new buffer, but that is only one allocation (a pointer increment with a semi-space collector) and the copy is cheap since the data is in cache. -- Johan