Ready for testing: Unicode support for Handle I/O

I've been working on adding proper Unicode support to Handle I/O in GHC, and I finally have something that's ready for testing. I've put a patchset here: http://www.haskell.org/~simonmar/base-unicode.tar.gz That is a set of patches against a GHC repo tree: unpack the tarball, and say 'sh apply /path/to/ghc/repo' to apply all the patches. Then clean your tree and build it from scratch (or if you're using the new GHC build system, just say 'make' ;-). It should validate, bar one or two minor failures. Oh, it doesn't work on Windows yet. That's the major thing left to do. If anyone else felt like tackling this I'd be delighted: all you have to do is implement a Win32 equivalent of the module GHC.IO.Encoding.Iconv (see below), everything else should work unchanged. Depending on whether any further changes are required, I may amend-record some of these patches, so treat them as temporary patches for testing only. Below is what will be the patch description in the patch for libraries/base. Comments/discussion please! Cheers, Simon Unicode-aware Handles ~~~~~~~~~~~~~~~~~~~~~ This is a significant restructuring of the Handle implementation with the primary goal of supporting Unicode character encodings. The only change to the existing behaviour is that by default, text IO is done in the prevailing encoding of the system. Handles created by openBinaryFile use the Latin-1 encoding, as do Handles placed in binary mode using hSetBinaryMode. We provide a way to change the encoding for an existing Handle: hSetEncoding :: Handle -> TextEncoding -> IO () and various encodings: latin1, utf8, utf16, utf16le, utf16be, utf32, utf32le, utf32be, localeEncoding, and a way to lookup other encodings: mkTextEncoding :: String -> IO TextEncoding (it's system-dependent whether the requested encoding will be available). Currently hSetEncoding is availble from GHC.IO.Handle, and the encodings are available from GHC.IO.Encoding. We may want to export these from somewhere more permanent; that's something for a library proposal. Thanks to suggestions from Duncan Coutts, it's possible to call hSetEncoding even on buffered read Handles, and the right thing happens. So we can read from text streams that include multiple encodings, such as an HTTP response or email message, without having to turn buffering off (though there is a penalty for switching encodings on a buffered Handle, as the IO system has to do some re-decoding to figure out where it should start reading from again). If there is a decoding error, it is reported when an attempt is made to read the offending character from the Handle, as you would expect. Performance is about 30% slower on "hGetContents >>= putStr" than before. I've profiled it, and about 25% of this is in doing the actual encoding/decoding, the rest is accounted for by the fact that we're shuffling around 32-bit chars rather than bytes in the Handle buffer, so there's not much we can do to improve this. IO library restructuring ~~~~~~~~~~~~~~~~~~~~~~~~ The major change here is that the implementation of the Handle operations is separated from the underlying IO device, using type classes. File descriptors are just one IO provider; I have also implemented memory-mapped files (good for random-access read/write) and a Handle that pipes output to a Chan (useful for testing code that writes to a Handle). New kinds of Handle can be implemented outside the base package, for instance someone could write bytestringToHandle. A Handle is made using mkFileHandle: -- | makes a new 'Handle' mkFileHandle :: (IODevice dev, BufferedIO dev, Typeable dev) => dev -- ^ the underlying IO device, which must support -- 'IODevice', 'BufferedIO' and 'Typeable' -> FilePath -- ^ a string describing the 'Handle', e.g. the file -- path for a file. Used in error messages. -> IOMode -- ^ The mode in which the 'Handle' is to be used -> Maybe TextEncoding -- ^ text encoding to use, if any -> IO Handle This also means that someone can write a completely new IO implementation on Windows based on native Win32 HANDLEs, and distribute it as a separate package (I really hope somebody does this!). This restructuring isn't as radical as previous designs. I haven't made any attempt to make a separate binary I/O layer, for example (although hGetBuf/hPutBuf do bypass the text encoding). The main goal here was to get Unicode support in, and to allow others to experiment with making new kinds of Handle. We could split up the layers further later. API changes and Module structure ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ NB. GHC.IOBase and GHC.Handle are now DEPRECATED (they are still present, but are just re-exporting things from other modules now). For 6.12 we'll want to bump base to version 5 and add a base4-compat. For now I'm using #if __GLASGOW_HASKEL__ >= 611 to avoid deprecated warnings. I split modules into smaller parts in many places. For example, we now have GHC.IORef, GHC.MVar and GHC.IOArray containing the implementations of IORef, MVar and IOArray respectively. This was necessary for untangling dependencies, but it also makes things easier to follow. The new module structurue for the IO-relatied parts of the base package is: GHC.IO Implementation of the IO monad; unsafe*; throw/catch GHC.IO.IOMode The IOMode type GHC.IO.Buffer Buffers and operations on them GHC.IO.Device The IODevice and RawIO classes. GHC.IO.BufferedIO The BufferedIO class. GHC.IO.FD The FD type, with instances of IODevice, RawIO and BufferedIO. GHC.IO.Exception IO-related Exceptions GHC.IO.Encoding The TextEncoding type; built-in TextEncodings; mkTextEncoding GHC.IO.Encoding.Types GHC.IO.Encoding.Iconv Implementation internals for GHC.IO.Encoding GHC.IO.Handle The main API for GHC's Handle implementation, provides all the Handle operations + mkFileHandle + hSetEncoding. GHC.IO.Handle.Types GHC.IO.Handle.Internals GHC.IO.Handle.Text Implementation of Handles and operations. GHC.IO.Handle.FD Parts of the Handle API implemented by file-descriptors: openFile, stdin, stdout, stderr, fdToHandle etc.

Simon Marlow wrote:
I've been working on adding proper Unicode support to Handle I/O in GHC, and I finally have something that's ready for testing. I've put a patchset here:
Yay! Comments below.
Comments/discussion please!
Do you expect Hugs will be able to pick up all of this?
The only change to the existing behaviour is that by default, text IO is done in the prevailing encoding of the system. Handles created by openBinaryFile use the Latin-1 encoding, as do Handles placed in binary mode using hSetBinaryMode.
Sounds very good and reasonable.
We provide a way to change the encoding for an existing Handle:
hSetEncoding :: Handle -> TextEncoding -> IO ()
and various encodings:
latin1, utf8, utf16, utf16le, utf16be, utf32, utf32le, utf32be, localeEncoding,
Will there also be something to handle the UTF-16 BOM marker? I'm not sure what the best API for that is, since it may or may not be present, but it should be considered -- and could perhaps help autodetect encoding.
Thanks to suggestions from Duncan Coutts, it's possible to call hSetEncoding even on buffered read Handles, and the right thing happens. So we can read from text streams that include multiple encodings, such as an HTTP response or email message, without having to turn buffering off (though there is a penalty for switching encodings on a buffered Handle, as the IO system has to do some re-decoding to figure out where it should start reading from again).
Sounds useful, but is this the bit that causes the 30% performance hit?
Performance is about 30% slower on "hGetContents >>= putStr" than before. I've profiled it, and about 25% of this is in doing the actual encoding/decoding, the rest is accounted for by the fact that we're shuffling around 32-bit chars rather than bytes in the Handle buffer, so there's not much we can do to improve this.
Does this mean that if we set the encoding to latin1, tat we should see performance 5% worse than present? 30% slower is a big deal, especially since we're not all that speedy now.
IO library restructuring ~~~~~~~~~~~~~~~~~~~~~~~~
The major change here is that the implementation of the Handle operations is separated from the underlying IO device, using type classes. File descriptors are just one IO provider; I have also implemented memory-mapped files (good for random-access read/write) and a Handle that pipes output to a Chan (useful for testing code that writes to a Handle). New kinds of Handle can be implemented outside the base package, for instance someone could write bytestringToHandle. A Handle is made using mkFileHandle:
Very nice. That means I can eliminate all the HVIO stuff I have in MissingH, which does roughly the same thing.
with making new kinds of Handle. We could split up the layers further later.
Would it now be possible to make the Socket an instance of this typeclass, so we can work with it directly rather than having to convert it to a Handle first? Thanks, -- John

On Tue, 2009-02-03 at 11:03 -0600, John Goerzen wrote:
Will there also be something to handle the UTF-16 BOM marker? I'm not sure what the best API for that is, since it may or may not be present, but it should be considered -- and could perhaps help autodetect encoding.
I think someone else mentioned this already, but utf16 (as opposed to utf16be/le) will use the BOM if its present. I'm not quite sure what happens when you switch encoding, presumably it'll accept and consider a BOM at that point.
Thanks to suggestions from Duncan Coutts, it's possible to call hSetEncoding even on buffered read Handles, and the right thing happens. So we can read from text streams that include multiple encodings, such as an HTTP response or email message, without having to turn buffering off (though there is a penalty for switching encodings on a buffered Handle, as the IO system has to do some re-decoding to figure out where it should start reading from again).
Sounds useful, but is this the bit that causes the 30% performance hit?
No. You only pay that penalty if you switch encoding. The standard case has no extra cost.
Performance is about 30% slower on "hGetContents >>= putStr" than before. I've profiled it, and about 25% of this is in doing the actual encoding/decoding, the rest is accounted for by the fact that we're shuffling around 32-bit chars rather than bytes in the Handle buffer, so there's not much we can do to improve this.
Does this mean that if we set the encoding to latin1, tat we should see performance 5% worse than present?
No, I think that's 30% for latin1. The cost is not really the character conversion but the copying from a byte buffer via iconv to a char buffer.
30% slower is a big deal, especially since we're not all that speedy now.
Bear in mind that's talking about the [Char] interface, and nobody using that is expecting great performance. We already have an API for getting big chunks of bytes out of a Handle, with the new Handle we'll also want something equivalent for a packed text representation. Hopefully we can get something nice with the new text package. Duncan

On Tue, Feb 03, 2009 at 10:56:13PM +0000, Duncan Coutts wrote:
Thanks to suggestions from Duncan Coutts, it's possible to call hSetEncoding even on buffered read Handles, and the right thing happens. So we can read from text streams that include multiple encodings, such as an HTTP response or email message, without having to turn buffering off (though there is a penalty for switching encodings on a buffered Handle, as the IO system has to do some re-decoding to figure out where it should start reading from again).
Sounds useful, but is this the bit that causes the 30% performance hit?
No. You only pay that penalty if you switch encoding. The standard case has no extra cost.
I'm confused. I thought the standard case was conversion to the system's local encoding? How is that different than selecting the same encoding manually? There always has to be *some* conversion from a 32-bit Char to the system's selection, right? What exactly do we have to do to avoid the penalty?
No, I think that's 30% for latin1. The cost is not really the character conversion but the copying from a byte buffer via iconv to a char buffer.
Don't we already have to copy between a byte buffer and a char buffer, since read() and write() use a byte buffer?
30% slower is a big deal, especially since we're not all that speedy now.
Bear in mind that's talking about the [Char] interface, and nobody using that is expecting great performance. We already have an API for getting
Yes, I know, but it's still the most convenient interface, and making it suck more isn't cool -- though there are certainly big wins here. -- John

On Tue, 2009-02-03 at 17:39 -0600, John Goerzen wrote:
On Tue, Feb 03, 2009 at 10:56:13PM +0000, Duncan Coutts wrote:
Thanks to suggestions from Duncan Coutts, it's possible to call hSetEncoding even on buffered read Handles, and the right thing happens. So we can read from text streams that include multiple encodings, such as an HTTP response or email message, without having to turn buffering off (though there is a penalty for switching encodings on a buffered Handle, as the IO system has to do some re-decoding to figure out where it should start reading from again).
Sounds useful, but is this the bit that causes the 30% performance hit?
No. You only pay that penalty if you switch encoding. The standard case has no extra cost.
I'm confused. I thought the standard case was conversion to the system's local encoding? How is that different than selecting the same encoding manually?
Sorry, I think we've been talking at cross purposes.
There always has to be *some* conversion from a 32-bit Char to the system's selection, right?
Yes. In text mode there is always some conversion going on. Internally there is a byte buffer and a char buffer (ie UTF32).
What exactly do we have to do to avoid the penalty?
The penalty we're talking about here is not the cost of converting bytes to characters, it's in switching which encoding the Handle is using. For example you might read some HTTP headers in ASCII and then switch the Handle encoding to UTF8 to read some XML. Switching the Handle encoding has a penalty. We have to discard the characters that we pre-decoded and re-decode the byte buffer in the new encoding. It's actually slightly more complicated because we do not track exactly how the byte and character buffers relate to each other (it'd be too expensive in the normal cases) so to work out the relationship when switching encoding we have to re-decode all the way from the beginning of the current byte buffer. The point is, in terms of performance we get the ability to switch handle encoding more or less for free. It has a cost in terms of code complexity. The simpler alternative design was that you would not be able to switch encoding on a read handle that used any buffering at the character level without loosing bytes. The performance penalty when switching encoding is the downside to the ordinary code path being fast.
No, I think that's 30% for latin1. The cost is not really the character conversion but the copying from a byte buffer via iconv to a char buffer.
Don't we already have to copy between a byte buffer and a char buffer, since read() and write() use a byte buffer?
In the existing Handle mechanism we read() into a byte buffer and then when doing say getLine or getContents we allocate [Char]'s in a loop reading bytes directly from the byte buffer. There is no separate character buffer. Duncan

Duncan Coutts wrote:
Sorry, I think we've been talking at cross purposes.
I think so.
There always has to be *some* conversion from a 32-bit Char to the system's selection, right?
Yes. In text mode there is always some conversion going on. Internally there is a byte buffer and a char buffer (ie UTF32).
What exactly do we have to do to avoid the penalty?
The penalty we're talking about here is not the cost of converting bytes to characters, it's in switching which encoding the Handle is using. For example you might read some HTTP headers in ASCII and then switch the Handle encoding to UTF8 to read some XML.
Simon referenced a 30% penalty. Are you saying that if we read from a Handle using the same encoding that we used when we first opened it, that we won't see any slowdown vs. the system in 6.10?
Switching the Handle encoding has a penalty. We have to discard the characters that we pre-decoded and re-decode the byte buffer in the new encoding. It's actually slightly more complicated because we do not
Got it. That makes sense, as does the decision to optimize for the more common (not switching the encoding) case. -- John

Duncan Coutts wrote:
On Tue, 2009-02-03 at 11:03 -0600, John Goerzen wrote:
Will there also be something to handle the UTF-16 BOM marker? I'm not sure what the best API for that is, since it may or may not be present, but it should be considered -- and could perhaps help autodetect encoding.
I think someone else mentioned this already, but utf16 (as opposed to utf16be/le) will use the BOM if its present.
I'm not quite sure what happens when you switch encoding, presumably it'll accept and consider a BOM at that point.
Yes; the utf16 and utf32 encodings accept a BOM (and generate a BOM in write mode). This caused interesting bugs when doing re-decoding after switching encodings, because the BOM constitutes state in the decoder, which means that decoding is not necessarily repeatable unless you save the state (which iconv doesn't provide a way to do). Are there other encodings that have this kind of state? If so, then they might be restricted to NoBuffering at least when switching encodings. Cheers, Simon

On Wed, 2009-02-04 at 13:31 +0000, Simon Marlow wrote:
Duncan Coutts wrote:
On Tue, 2009-02-03 at 11:03 -0600, John Goerzen wrote:
Will there also be something to handle the UTF-16 BOM marker? I'm not sure what the best API for that is, since it may or may not be present, but it should be considered -- and could perhaps help autodetect encoding.
I think someone else mentioned this already, but utf16 (as opposed to utf16be/le) will use the BOM if its present.
I'm not quite sure what happens when you switch encoding, presumably it'll accept and consider a BOM at that point.
Yes; the utf16 and utf32 encodings accept a BOM (and generate a BOM in write mode). This caused interesting bugs when doing re-decoding after switching encodings, because the BOM constitutes state in the decoder, which means that decoding is not necessarily repeatable unless you save the state (which iconv doesn't provide a way to do).
Are there other encodings that have this kind of state? If so, then they might be restricted to NoBuffering at least when switching encodings.
Yes, I believe there are some Asian encodings that are stateful. Duncan

Simon Marlow wrote:
The only change to the existing behaviour is that by default, text IO is done in the prevailing encoding of the system. Handles created by openBinaryFile use the Latin-1 encoding, as do Handles placed in binary mode using hSetBinaryMode.
wouldn't be semantically correct for a "binary handle" to "return" [Word8]? also switching from text to binary (hSetBinaryMode) doesn't seem "natural" I understand that this has "heavy" consequences... Pao

Paolo Losi пишет:
wouldn't be semantically correct for a "binary handle" to "return" [Word8]? Wouldn't it be more correct to separate binary IO, which return [Word8] (or ByteString) and text IO which return [Char] and deal with text encoding? IIRC that was done in Bulat Ziganshin's streams library.
-- WBR, Max Vasin.

Max Vasin wrote:
Wouldn't it be more correct to separate binary IO, which return [Word8] (or ByteString) and text IO which return [Char] and deal with text encoding? IIRC that was done in Bulat Ziganshin's streams library.
That's exactly what I meant. Text IO could be then implemented on top of binary IO. Would it be possible to envision a solution that enables composition of IO low level strategies vs binary <-> text conversion strategies? Pao

Paolo Losi wrote:
Simon Marlow wrote:
The only change to the existing behaviour is that by default, text IO is done in the prevailing encoding of the system. Handles created by openBinaryFile use the Latin-1 encoding, as do Handles placed in binary mode using hSetBinaryMode.
wouldn't be semantically correct for a "binary handle" to "return" [Word8]?
also switching from text to binary (hSetBinaryMode) doesn't seem "natural"
Yes, and as I said in the original message I haven't done the binary/text separation (yet). I agree it's something that should be done, and the current API leaves a lot to be desired, but the main goal was to get Unicode text I/O working without breaking any existing code (or at least without breaking any code that isn't already morally broken :-). As a side-effect I managed to do some useful refactoring which should make further separation of layers much easier. So you should think of this as a step in the right direction, with further steps to come in the future. A while back there was a lot of activity on developing new IO library designs. There are a bunch of these: Bulat's streams library, a variant of Bulat's done by Takano Akio, John Goerzen's HVIO, and I had a prototype streams library too. The problem is, it's a lot of work to make a complete IO library implementation, agree on the API, and migrate over from the old one. And while we're on the subject of redesigning IO libraries, it's not at all clear that the imperative approach is the right one either. Cheers, Simon

On Tuesday 03 February 2009 19:42:44 Simon Marlow wrote:
I've been working on adding proper Unicode support to Handle I/O in GHC, and I finally have something that's ready for testing. I've put a patchset here:
http://www.haskell.org/~simonmar/base-unicode.tar.gz
... skipped ...
Comments/discussion please!
How do you plan to handle filenames? Currently FilePath is simply a string. Would it be decoded/encoded automatically? If so there is a nasty catch. Not all valid filenames have representation as strings. On linux (and I suspect all unices) file name is sequence of bytes. For example let consider file with name {0xff} on computer with UTF8 locale. It's valid and everything, but its name cannot be converted to string. 0xff byte cannot appear in UTF8 strings. -- Khudyakov Alexey

Hello Khudyakov, Saturday, February 7, 2009, 4:01:57 PM, you wrote:
How do you plan to handle filenames? Currently FilePath is simply a string.
i think that this patch does nothing to unicode filenames support -- Best regards, Bulat mailto:Bulat.Ziganshin@gmail.com

Bulat Ziganshin wrote:
Hello Khudyakov,
Saturday, February 7, 2009, 4:01:57 PM, you wrote:
How do you plan to handle filenames? Currently FilePath is simply a string.
i think that this patch does nothing to unicode filenames support
Correct - I'm aware that there's a problem with filenames, but it hasn't been tackled yet. There probably isn't anything sensible that we can do without changing FilePath into an ADT. Cheers, Simon

Hello Simon, Thursday, February 19, 2009, 3:21:03 PM, you wrote:
Correct - I'm aware that there's a problem with filenames, but it hasn't been tackled yet. There probably isn't anything sensible that we can do without changing FilePath into an ADT.
i think that FilePath=String is ok, we just need to use utf8 string encoding on unix/macs and utf-16 encoding with *W functions on Windows. i have implemented such support inside my own application, and would be happy to mentor appropriate gsoc project -- Best regards, Bulat mailto:Bulat.Ziganshin@gmail.com
participants (7)
-
Bulat Ziganshin
-
Duncan Coutts
-
John Goerzen
-
Khudyakov Alexey
-
Max Vasin
-
Paolo Losi
-
Simon Marlow