
On Wed, Aug 18, 2010 at 2:39 PM, Johan Tibell
On Wed, Aug 18, 2010 at 2:12 AM, John Meacham
wrote: <ranty thing to follow> That said, there is never a reason to use UTF-16, it is a vestigial remanent from the brief period when it was thought 16 bits would be enough for the unicode standard, any defense of it nowadays is after the fact justification for having accidentally standardized on it back in the day.
This is false. Text uses UTF-16 internally as early benchmarks indicated that it was faster. See Tom Harper's response to the other thread that was spawned of this thread by Ketil.
Text continues to be UTF-16 today because
* no one has written a benchmark that shows that UTF-8 would be faster *for use in Data.Text*, and * no one has written a patch that converts Text to use UTF-8 internally.
I'm quite frustrated by this whole discussion; there's lots of talking, no coding, and only a little benchmarking (of web sites, not code). This will get us nowhere.
Here's my response to the two points:
* I haven't written a patch showing that Data.Text would be faster using UTF-8 because that would require fulfilling the second point (I'll get to in a second). I *have* shown where there are huge performance differences between text and ByteString/String. Unfortunately, the response has been "don't use bytestring, it's the wrong datatype, text will get fixed," which is quite underwhelming. * Since the prevailing attitude has been such a disregard to any facts shown thus far, it seems that the effort required to learn the internals of the text package and attempt a patch would be wasted. In the meanwhile, Jasper has released blaze-builder which does an amazing job at producing UTF-8 encoded data, which for the moment is my main need. As much as I'll be chastised by the community, I'll stick with this approach for the moment. Now if you tell me that text would consider applying a UTF-8 patch, that would be a different story. But I don't have the time to maintain a separate UTF-8 version of text. For me, the whole point of this discussion was to determine whether we should attempt porting to UTF-8, which as I understand it would be a rather large undertaking. Michael