
On Wed, 2007-09-05 at 21:30 +0200, Peter Simons wrote:
As far as I can tell, the only reason why a function like 'unsafeUseAsCStringLen' has to be dubbed unsafe is because 'index' makes it unsafe. The limitation that ByteString has to be immutable is a consequence of the choice to provide 'index' as a pure function.
Well, it's not just index, all the functions that get data from the ByteString, like head/tail/uncons etc etc are pure. That is the whole point of the design of ByteString, to provide pure/immutable high performance strings. What you want is just fine, but it's a mutable interface not a pure one. We cannot provide any operations that mutate an existing ByteString without breaking the semantics of all the pure operations. It's very much like the difference between the MArray and IArray classes, for mutable and immutable arrays. One provides index in a monad, the other is pure.
Personally, I won't use 'index' in my code. I'll happily dereference the pointer in the IO monad, because I've found that to be no effort whatsoever. I love monads. For my purposes, 'unsafeUseAsCStringLen' is a perfectly safe function. The efficient variant of 'hGet' I posted can be implemented on top of it, so that 'hGet' is by all means a safe function in my code. There really is no risk at all, unless one uses 'index' or something that's based on it.
Right, or if you were to hand out a ByteString and then change the contents of it when nobody is looking then that's very much unsafe. So the point is you can break the semantics locally and nobody will notice. It's not a technique we should encourage however.
The way I see it, there will be other people who'll find the performance limitations of standard 'hGet' a decisive factor in their design decisions. Chances are, those people will wonder about using the base pointer for hGetBuf and then they'll end up re-inventing the wheel we just came up with.
I'd rather not provide a quick easy way to break the semantics. unsafeUseAsCStringLen and friends are already plenty enough rope...
Maybe I'll find the time to submit a patch to the documentation, so that fine points like an optimal buffer size etc. are explained in more detail than they are right now. It would be nice if some kind of result would come out of this discussion.
I really don't think we can provide anything that copies into an existing pre-allocated ByteString. As far as I can see, the best we can do is to allocate a fresh buffer and do a single copy into that. Mutating an existing buffer is fine, and System.IO already provides hGetBuf. But you have to be really really careful if you create a ByteString based on the contents of that mutable buffer, without making any copy first.
Anyway, thank you. I appreciate everyone's efforts in helping me figure out why I/O with ByteString is more than two times slower than it could be.
Thanks very much for pointing out where we are copying more than necessary. As for the last bit of performance difference due to the cache benefits of reusing a mutable buffer rather than allocating and GCing a range of buffer, I can't see any way within the existing design how we can achieve that. Bear in mind, that these cache benefits are fairly small in real benchmarks as opposed to 'cat' on fully cached files. Usually you do some actual IO and some operation on the data rather than just copying it from one file descriptor to another. For example, my lazy bytestring binding to iconv performs exactly the same as the command line iconv. In that case we are doing a bit of work on the data which swamps the cache benefits that the command line iconv prog gets from using mutable buffers. If we are trying to optimise the 'cat' case however, eg for network servers, there are even lower level things we can do so that no copies of the data have to be made at all. eg mmap or linux's copyfile or splice. ByteString certainly isn't the right abstraction for that though. Duncan