Re: ByteString I/O Performance

6 Sep 2007


      Duncan Coutts writes:
...
What you want is just fine, but it's a mutable interface not a
pure one. We cannot provide any operations that mutate an
existing ByteString without breaking the semantics of all the
pure operations.
Is that so? How exactly does mutating a ByteString break the
semantics of the pure function 'take'?
...
It's very much like the difference between the MArray and
IArray classes, for mutable and immutable arrays. One provides
index in a monad, the other is pure.
Right. Now I wonder: why does ByteString provide an immutable
interface but not a mutable one? Apparently mutable interfaces
are useful for some purposes, right? Why else would the Array
package provide one?
...
...
Personally, I won't use 'index' in my code. I'll happily
dereference the pointer in the IO monad [...]. For my
purposes, 'unsafeUseAsCStringLen' is a perfectly safe
function.
[So you can] break the semantics locally and nobody will
notice. It's not a technique we should encourage however.
Why do you keep saying that I would break semantics? The code
that breaks semantics is the one that uses unsafePerformIO, and
ByteString does that, not my code. When I refrain from using
those parts of Data.ByteString, then my use of the underlying
pointer doesn't "break semantics locally", it doesn't break them
at all.
...
Thanks very much for pointing out where we are copying more
than necessary.
You are welcome. And please don't forget the race condition in
readFile. I feel that problem is actually more significant than
the one we're discussing right now, but my impression is not that
the report would have been taken particularly seriously.
...
As for the last bit of performance difference due to the cache
benefits of reusing a mutable buffer rather than allocating
and GCing a range of buffer, I can't see any way within the
existing design how we can achieve that.
Exactly, with the current API that problem cannot be solved. It's
a shame.
...
Bear in mind, that these cache benefits are fairly small in
real benchmarks as opposed to 'cat' on fully cached files.
Do I understand that right? It sounds as if you were saying that
-- in the general case -- allocating a new buffer for every
single read() is not significantly slower than re-using the same
buffer every time. Is that what you meant to say?
...
If we are trying to optimise the 'cat' case however, eg for
network servers, there are even lower level things we can do
so that no copies of the data have to be made at all. eg mmap
or linux's copyfile or splice.
Yes, that's true, but none of these functions is remotely
portable -- or even available in any of the Haskell
implementations. So these things concern me considerably less
than the 'hGet' function that _is_ available. Or rather, the one
that apparently won't be available because we Haskell are way too
clever for efficient software.
...
ByteString certainly isn't the right abstraction for that
though.
I am sorry, but that is nonsense. A ByteString is a tuple
consisting of a pointer into raw memory, an integer signifying
the size of the front gap, and an integer signifying the length
of the payload. That data structure is near perfect for
performing efficient I/O. To say that this abstraction isn't
right for the task is absurd. What you mean to say is that you
don't _intend_ it to be used that way, which is an altogether
different thing.

Best regards,
Peter

Re: ByteString I/O Performance

Peter Simons