Re: [Haskell-cafe] Re: String vs ByteString

17 Aug 2010


      On Tue, Aug 17, 2010 at 9:30 PM, Donn Cave  wrote:
...
Quoth John Millikin ,
...
Ruby, which has an enormous Japanese userbase, solved the problem by
essentially defining Text = (Encoding, ByteString), and then
re-implementing text logic for each encoding. This allows very
efficient operation with every possible encoding, at the cost of
increased complexity (caching decoded characters, multi-byte handling,
etc).
Ruby actually comes from the CJK world in a way, doesn't it?
Even if efficient per-encoding manipulation is a tough nut to crack,
it at least avoids the fixed cost of bulk decoding, so an application
designer doesn't need to  think about the pay-off for a correct text
approach vs. `binary'/ASCII, and the language/library designer doesn't
need to think about whether genome data is a representative case etc.
Remember that the cost of decoding is O(n) no matter what encoding is used
internally as you always have to validate when going from  ByteString to
Text. If the external and internal encoding don't match then you also have
to copy the bytes into a new buffer, but that is only one allocation (a
pointer increment with a semi-space collector) and the copy is cheap since
the data is in cache.

-- Johan

Re: [Haskell-cafe] Re: String vs ByteString

Johan Tibell