
Duncan Coutts writes:
What you want is just fine, but it's a mutable interface not a pure one. We cannot provide any operations that mutate an existing ByteString without breaking the semantics of all the pure operations.
Is that so? How exactly does mutating a ByteString break the semantics of the pure function 'take'?
It's very much like the difference between the MArray and IArray classes, for mutable and immutable arrays. One provides index in a monad, the other is pure.
Right. Now I wonder: why does ByteString provide an immutable interface but not a mutable one? Apparently mutable interfaces are useful for some purposes, right? Why else would the Array package provide one?
Personally, I won't use 'index' in my code. I'll happily dereference the pointer in the IO monad [...]. For my purposes, 'unsafeUseAsCStringLen' is a perfectly safe function.
[So you can] break the semantics locally and nobody will notice. It's not a technique we should encourage however.
Why do you keep saying that I would break semantics? The code that breaks semantics is the one that uses unsafePerformIO, and ByteString does that, not my code. When I refrain from using those parts of Data.ByteString, then my use of the underlying pointer doesn't "break semantics locally", it doesn't break them at all.
Thanks very much for pointing out where we are copying more than necessary.
You are welcome. And please don't forget the race condition in readFile. I feel that problem is actually more significant than the one we're discussing right now, but my impression is not that the report would have been taken particularly seriously.
As for the last bit of performance difference due to the cache benefits of reusing a mutable buffer rather than allocating and GCing a range of buffer, I can't see any way within the existing design how we can achieve that.
Exactly, with the current API that problem cannot be solved. It's a shame.
Bear in mind, that these cache benefits are fairly small in real benchmarks as opposed to 'cat' on fully cached files.
Do I understand that right? It sounds as if you were saying that -- in the general case -- allocating a new buffer for every single read() is not significantly slower than re-using the same buffer every time. Is that what you meant to say?
If we are trying to optimise the 'cat' case however, eg for network servers, there are even lower level things we can do so that no copies of the data have to be made at all. eg mmap or linux's copyfile or splice.
Yes, that's true, but none of these functions is remotely portable -- or even available in any of the Haskell implementations. So these things concern me considerably less than the 'hGet' function that _is_ available. Or rather, the one that apparently won't be available because we Haskell are way too clever for efficient software.
ByteString certainly isn't the right abstraction for that though.
I am sorry, but that is nonsense. A ByteString is a tuple consisting of a pointer into raw memory, an integer signifying the size of the front gap, and an integer signifying the length of the payload. That data structure is near perfect for performing efficient I/O. To say that this abstraction isn't right for the task is absurd. What you mean to say is that you don't _intend_ it to be used that way, which is an altogether different thing. Best regards, Peter