
On Tue, 2009-02-03 at 17:39 -0600, John Goerzen wrote:
On Tue, Feb 03, 2009 at 10:56:13PM +0000, Duncan Coutts wrote:
Thanks to suggestions from Duncan Coutts, it's possible to call hSetEncoding even on buffered read Handles, and the right thing happens. So we can read from text streams that include multiple encodings, such as an HTTP response or email message, without having to turn buffering off (though there is a penalty for switching encodings on a buffered Handle, as the IO system has to do some re-decoding to figure out where it should start reading from again).
Sounds useful, but is this the bit that causes the 30% performance hit?
No. You only pay that penalty if you switch encoding. The standard case has no extra cost.
I'm confused. I thought the standard case was conversion to the system's local encoding? How is that different than selecting the same encoding manually?
Sorry, I think we've been talking at cross purposes.
There always has to be *some* conversion from a 32-bit Char to the system's selection, right?
Yes. In text mode there is always some conversion going on. Internally there is a byte buffer and a char buffer (ie UTF32).
What exactly do we have to do to avoid the penalty?
The penalty we're talking about here is not the cost of converting bytes to characters, it's in switching which encoding the Handle is using. For example you might read some HTTP headers in ASCII and then switch the Handle encoding to UTF8 to read some XML. Switching the Handle encoding has a penalty. We have to discard the characters that we pre-decoded and re-decode the byte buffer in the new encoding. It's actually slightly more complicated because we do not track exactly how the byte and character buffers relate to each other (it'd be too expensive in the normal cases) so to work out the relationship when switching encoding we have to re-decode all the way from the beginning of the current byte buffer. The point is, in terms of performance we get the ability to switch handle encoding more or less for free. It has a cost in terms of code complexity. The simpler alternative design was that you would not be able to switch encoding on a read handle that used any buffering at the character level without loosing bytes. The performance penalty when switching encoding is the downside to the ordinary code path being fast.
No, I think that's 30% for latin1. The cost is not really the character conversion but the copying from a byte buffer via iconv to a char buffer.
Don't we already have to copy between a byte buffer and a char buffer, since read() and write() use a byte buffer?
In the existing Handle mechanism we read() into a byte buffer and then when doing say getLine or getContents we allocate [Char]'s in a loop reading bytes directly from the byte buffer. There is no separate character buffer. Duncan