ByteString-backed Handles, and another couple of questions

Hi, Simon - I just added support to Data.Text for your new Unicode-based Handle implementation, and I'd like to write some tests. The natural way to do this would be to create Handles that will write to, and read from, ByteStrings. Does any such code exist at the moment? I don't see it in base or bytestring, though all the necessary abstractions appear to be present. Also, the place I hooked into the new I/O machinery was at the next level up from CharBuffer. Because the implementation of CharBuffer isn't abstract, I had no opportunity to put a text array in there, so there's an extra amount of copying that happens when going from byte buffer to char buffer to Text. It's a bit of a shame, but I don't see a way around it at the moment. Would you be interested in trying to remove that extra copy, or is the current interface set in stone? Many thanks for your great work on this, Bryan.

On 15/12/09 06:09, Bryan O'Sullivan wrote:
I just added support to Data.Text for your new Unicode-based Handle implementation, and I'd like to write some tests. The natural way to do this would be to create Handles that will write to, and read from, ByteStrings. Does any such code exist at the moment? I don't see it in base or bytestring, though all the necessary abstractions appear to be present.
I haven't implemented a bytestring-backed Handle, but as you say all the abstractions should be present. It would be a great thing to have on Hackage. A good starting point would be the mmap-backed Handle code that I wrote for my talk at the Haskell Implementors Workshop last year. I'd intended to polish this up and upload to Hackage, but never got around to it. I've put the code here for now: http://www.haskell.org/~simonmar/mmap-handle.tar.gz
Also, the place I hooked into the new I/O machinery was at the next level up from CharBuffer. Because the implementation of CharBuffer isn't abstract, I had no opportunity to put a text array in there, so there's an extra amount of copying that happens when going from byte buffer to char buffer to Text. It's a bit of a shame, but I don't see a way around it at the moment. Would you be interested in trying to remove that extra copy, or is the current interface set in stone?
Yes, you may remember we talked about this in Edinburgh (the conversion would probably make more sense to you now than it did then :-). One thing I experimented with is making CharBuffers use UTF-16. You'll see some instances of #ifdef CHARBUF_UTF16 in the code - it partially works, I believe the main missing piece is support in the built-in codecs. I don't think it would be too hard to fix them, they just need to more abstract about offsets in the CharBuffer; writeCharBuffer/readCharBuffer already handle the UTF-16 encoding/decoding. So one possibility is to get this working and then avoid the extra copy by just taking out the ByteArray# inside a CharBuffer and turning it into a text buffer. I'm not sure of the details here, but I imagine something along those lines would work. We would then have to allocate a new CharBuffer for the Handle. Another possibility is (as you suggested) to make Handles independent of the representation of the CharBuffer, making it completely abstract. I haven't put much thought into that, it might well be a better approach. It would presumably involve a new existential class constraint in the Handle for the CharBuffer operations, and we'd have to be careful about performance: currently I think the CharBuffer operations get inlined nicely. Cheers, Simon

On Tue, Dec 15, 2009 at 1:39 AM, Simon Marlow
I haven't implemented a bytestring-backed Handle, but as you say all the abstractions should be present. It would be a great thing to have on Hackage.
A good starting point would be the mmap-backed Handle code that I wrote for my talk at the Haskell Implementors Workshop last year. I'd intended to polish this up and upload to Hackage, but never got around to it. I've put the code here for now:
Ooh, thanks! I'll take a look-see.
Yes, you may remember we talked about this in Edinburgh (the conversion would probably make more sense to you now than it did then :-).
I do indeed remember :-) One thing I experimented with is making CharBuffers use UTF-16. You'll see
some instances of #ifdef CHARBUF_UTF16 in the code - it partially works, I believe the main missing piece is support in the built-in codecs. I don't think it would be too hard to fix them, they just need to more abstract about offsets in the CharBuffer; writeCharBuffer/readCharBuffer already handle the UTF-16 encoding/decoding.
So one possibility is to get this working and then avoid the extra copy by just taking out the ByteArray# inside a CharBuffer and turning it into a text buffer. I'm not sure of the details here, but I imagine something along those lines would work. We would then have to allocate a new CharBuffer for the Handle.
Yes, that would amount to double-buffering, and would work nicely, only the current buffers go through foreign pointers while text uses an unpinned array. I can see why this is (so iconv can actually work), but it does introduce a fly into the ointment :-)
Another possibility is (as you suggested) to make Handles independent of the representation of the CharBuffer, making it completely abstract. I haven't put much thought into that, it might well be a better approach. It would presumably involve a new existential class constraint in the Handle for the CharBuffer operations, and we'd have to be careful about performance: currently I think the CharBuffer operations get inlined nicely.
Aye. I think this would have the same problem with foreign transcoding code that wants a reliable pointer.

On Tue, 2009-12-15 at 12:48 -0800, Bryan O'Sullivan wrote:
Yes, that would amount to double-buffering, and would work nicely, only the current buffers go through foreign pointers while text uses an unpinned array. I can see why this is (so iconv can actually work), but it does introduce a fly into the ointment :-)
It should not be strictly necessary to use a ForeignPtr in this case. If the IO buffers use pinned ByteArray#s then they can still be passed to iconv for it to write into. It should also be possible for Text to be constructed from a pinned ByteArray#. Duncan

On 16/12/09 03:02, Duncan Coutts wrote:
On Tue, 2009-12-15 at 12:48 -0800, Bryan O'Sullivan wrote:
Yes, that would amount to double-buffering, and would work nicely, only the current buffers go through foreign pointers while text uses an unpinned array. I can see why this is (so iconv can actually work), but it does introduce a fly into the ointment :-)
It should not be strictly necessary to use a ForeignPtr in this case. If the IO buffers use pinned ByteArray#s then they can still be passed to iconv for it to write into.
It should also be possible for Text to be constructed from a pinned ByteArray#.
I don't think there's any real difficulty here. The IO buffers are ForeignPtrs because we want the flexibility of being able to use mmap'd memory - the mmap-backed Handle implementation uses that. But in normal operation these ForeignPtrs have pinned ByteArray#s inside. We could provide a function of type ForeignPtr -> IO (Maybe ByteArray), and the text package can turn that into a chunk of Text. It would even be able to support reading Text from mmaped memory, because only the byte buffer needs to be mmaped, not the Char buffer. Cheers, Simon
participants (3)
-
Bryan O'Sullivan
-
Duncan Coutts
-
Simon Marlow