
Simon Marlow wrote:
I've been working on adding proper Unicode support to Handle I/O in GHC, and I finally have something that's ready for testing. I've put a patchset here:
Yay! Comments below.
Comments/discussion please!
Do you expect Hugs will be able to pick up all of this?
The only change to the existing behaviour is that by default, text IO is done in the prevailing encoding of the system. Handles created by openBinaryFile use the Latin-1 encoding, as do Handles placed in binary mode using hSetBinaryMode.
Sounds very good and reasonable.
We provide a way to change the encoding for an existing Handle:
hSetEncoding :: Handle -> TextEncoding -> IO ()
and various encodings:
latin1, utf8, utf16, utf16le, utf16be, utf32, utf32le, utf32be, localeEncoding,
Will there also be something to handle the UTF-16 BOM marker? I'm not sure what the best API for that is, since it may or may not be present, but it should be considered -- and could perhaps help autodetect encoding.
Thanks to suggestions from Duncan Coutts, it's possible to call hSetEncoding even on buffered read Handles, and the right thing happens. So we can read from text streams that include multiple encodings, such as an HTTP response or email message, without having to turn buffering off (though there is a penalty for switching encodings on a buffered Handle, as the IO system has to do some re-decoding to figure out where it should start reading from again).
Sounds useful, but is this the bit that causes the 30% performance hit?
Performance is about 30% slower on "hGetContents >>= putStr" than before. I've profiled it, and about 25% of this is in doing the actual encoding/decoding, the rest is accounted for by the fact that we're shuffling around 32-bit chars rather than bytes in the Handle buffer, so there's not much we can do to improve this.
Does this mean that if we set the encoding to latin1, tat we should see performance 5% worse than present? 30% slower is a big deal, especially since we're not all that speedy now.
IO library restructuring ~~~~~~~~~~~~~~~~~~~~~~~~
The major change here is that the implementation of the Handle operations is separated from the underlying IO device, using type classes. File descriptors are just one IO provider; I have also implemented memory-mapped files (good for random-access read/write) and a Handle that pipes output to a Chan (useful for testing code that writes to a Handle). New kinds of Handle can be implemented outside the base package, for instance someone could write bytestringToHandle. A Handle is made using mkFileHandle:
Very nice. That means I can eliminate all the HVIO stuff I have in MissingH, which does roughly the same thing.
with making new kinds of Handle. We could split up the layers further later.
Would it now be possible to make the Socket an instance of this typeclass, so we can work with it directly rather than having to convert it to a Handle first? Thanks, -- John