
On Sat, Aug 14, 2010 at 10:46 PM, Michael Snoyman
When I'm writing a web app, my code is sitting on a Linux system where the default encoding is UTF-8, communicating with a database speaking UTF-8, receiving request bodies in UTF-8 and sending response bodies in UTF-8. So converting all of that data to UTF-16, just to be converted right back to UTF-8, does seem strange for that purpose.
Bear in mind that much of the data you're working with can't be readily trusted. UTF-8 coming from the filesystem, the network, and often the database may not be valid. The cost of validating it isn't all that different from the cost of converting it to UTF-16. And of course the internals of Data.Text are all fusion-based, so much of the time you're not going to be allocating UTF-16 arrays at all, but instead creating a pipeline of characters that are manipulated in a tight loop. This eliminates a lot of the additional copying that bytestring has to do, for instance. To give you an idea of how competitive Data.Text can be compared to C code, this is the system's wc command counting UTF-8 characters in a modestly large file: $ time wc -m huge.txt 32443330 real 0.728s This is Data.Text performing the same task: $ time ./FileRead text huge.txt 32443330 real 0.697s