Re: [Haskell-cafe] Re: String vs ByteString

15 Aug 2010

      On Sat, Aug 14, 2010 at 10:46 PM, Michael Snoyman wrote:
...
...
When I'm writing a web app, my code is sitting on a Linux system where the
default encoding is UTF-8, communicating with a database speaking UTF-8,
receiving request bodies in UTF-8 and sending response bodies in UTF-8. So
converting all of that data to UTF-16, just to be converted right back to
UTF-8, does seem strange for that purpose.
Bear in mind that much of the data you're working with can't be readily
trusted. UTF-8 coming from the filesystem, the network, and often the
database may not be valid. The cost of validating it isn't all that
different from the cost of converting it to UTF-16.

And of course the internals of Data.Text are all fusion-based, so much of
the time you're not going to be allocating UTF-16 arrays at all, but instead
creating a pipeline of characters that are manipulated in a tight loop. This
eliminates a lot of the additional copying that bytestring has to do, for
instance.

To give you an idea of how competitive Data.Text can be compared to C code,
this is the system's wc command counting UTF-8 characters in a modestly
large file:

$ time wc -m huge.txt
32443330
real 0.728s

This is Data.Text performing the same task:

$ time ./FileRead text huge.txt
32443330
real 0.697s

Re: [Haskell-cafe] Re: String vs ByteString

Bryan O'Sullivan