
On Tue, 2009-02-03 at 11:03 -0600, John Goerzen wrote:
Will there also be something to handle the UTF-16 BOM marker? I'm not sure what the best API for that is, since it may or may not be present, but it should be considered -- and could perhaps help autodetect encoding.
I think someone else mentioned this already, but utf16 (as opposed to utf16be/le) will use the BOM if its present. I'm not quite sure what happens when you switch encoding, presumably it'll accept and consider a BOM at that point.
Thanks to suggestions from Duncan Coutts, it's possible to call hSetEncoding even on buffered read Handles, and the right thing happens. So we can read from text streams that include multiple encodings, such as an HTTP response or email message, without having to turn buffering off (though there is a penalty for switching encodings on a buffered Handle, as the IO system has to do some re-decoding to figure out where it should start reading from again).
Sounds useful, but is this the bit that causes the 30% performance hit?
No. You only pay that penalty if you switch encoding. The standard case has no extra cost.
Performance is about 30% slower on "hGetContents >>= putStr" than before. I've profiled it, and about 25% of this is in doing the actual encoding/decoding, the rest is accounted for by the fact that we're shuffling around 32-bit chars rather than bytes in the Handle buffer, so there's not much we can do to improve this.
Does this mean that if we set the encoding to latin1, tat we should see performance 5% worse than present?
No, I think that's 30% for latin1. The cost is not really the character conversion but the copying from a byte buffer via iconv to a char buffer.
30% slower is a big deal, especially since we're not all that speedy now.
Bear in mind that's talking about the [Char] interface, and nobody using that is expecting great performance. We already have an API for getting big chunks of bytes out of a Handle, with the new Handle we'll also want something equivalent for a packed text representation. Hopefully we can get something nice with the new text package. Duncan