
On Mon, 2007-09-03 at 01:40 +0200, Peter Simons wrote:
Donald Bruce Stewart writes:
It's been a while since I benchmarked the IO performance, so looks like time to revisit this issue.
Hi Bruce,
my impression is that performance cannot be improved significantly without providing a modified input API. For some purposes reading input in large chunks is fine, but for some purposes it is not. A network proxy server, for example, cannot afford to buffer large amounts of data for every connection. An I/O buffer size of, say 256KB, would be really not a good choice for such a program. Something like 4KB typically is, but the small buffer size means that large amounts of data will be read using a fairly large number of hGet calls. As it is, hGet performs at least one malloc() per read() call. That will be slow, no matter how optimized that code is.
One way to get malloc() out of the picture would be to provide a variant of hGet that takes an existing, pre-allocated buffer as an argument, so that the user can allocate a ByteString once and re-use it for every single hGet and hPut.
Stefan is right, the only place that C malloc() is used is in strict ByteString's hGetContents. Everywhere else we use GHC's pinned heap allocation. Strict ByteString ReadFile does not use hGetContents because in that case we know the file length, it uses hGetBuf. Also, createAndTrim does not do any copying when there is no trimming necessary, so in the best case we do only a single copy using hGetBuf. As I recall from when I profiled this for the ByteString paper, with a lazy ByteString implementation of unix 'cat' on disk files (rather than a network socket) we should only copy each chunk once (as far as I can see). The slowdown compared to a simple hGetBuf 'cat' was all down to cache locality, because we're cycling between a range of buffers rather than a single cache-hot buffer. The time overhead of memory allocation and GC is negligible. In the tests I did at the time I found that on fully cached files the slowdown compared to using a single mutable buffer was about 2-3x. I figured that overhead is not bad, considering it represents the worst case when no work is being done to transform the data in any way and we're not doing any real IO, just copying data from kernel memory to user space memory. If we're doing lots of short reads (like when reading from sockets) then there is an opportunity for improvement. We could read into an internal buffer and cut off as an immutable ByteString only the chunk that got filled. The remainder of the buffer could be used for the following reads until the whole buffer is exhausted and a new buffer has to be allocated. This is the buffering strategy we use in the binary library when serialising. Duncan