Re: [Haskell-cafe] [16/16] SBM: Discussion and Conclusion

23 Dec 2007

      firefly:
...
General
-------
Bytestrings are faster than haskell strings.  This is good and expected.
But their memory use, ouch!  There seems to be a grave memory bug in 6.9.
Bug report -- since the bytestring library hasn't changed between 6.8.2
and head, that sounds like a bug report.
...
Lazy bytestrings are slower than strict bytestrings -- this was completely
unexpected for me.
Which programs should I look at? Is there a specific program or two
where lazy bytestring performance is entirely out of whack (against
6.8.2?)? If so, please forward it to me and Duncan.
...
I/O
---
The I/O patterns seen in the various programs are:
 1) strict bytestrings = read increasingly large chunks (they double every
    time).  I don't know if the already-read data have to be copied around or
    if the memory allocator can handle extending existing blocks (if they are
    at the "front" of the heap, say) like realloc() can.
if you're reading from stdin, and we don't know the size, bytestring
getContents uses realloc.
...
Even if the blocks do get copied around, that's not where the performance
    limiter is (speed-wise).  It does seem to be a bit unwise in terms of
    memory pressure.
Does the gc kick in every time the heap has to expand?  Does it do a full
    collection then?  If no other allocations have happened since the last
    collection than the allocation that caused the heap to expand, then perhaps
    skip the collection?  (or some criteria similar to that)
2) lazy bytestrings = read blocks of 32K-8 bytes, as expected.  The C
    benchmarks show that there's no penalty for reading 32K-8 vs. 32K.
3) native strings = read blocks 8K.
The C benchmarks show that it barely matters if the block size is 4K, 32K, or
32K-8 bytes.  In any case, the small differences due to block size are
completely in the noise.  Reading very large blocks, though, as the strict
bytestrings do, actually has a cost (38-70% slower, depending on the CPU and
kernel and RAM and busspeeds, etc -- see email 10 in the series).  It is still
peanets compared for almost all the benchmarks.
Useful information, thanks.
...
Backend
-------
The backend turns out to be rather bad.  It is generally very easy to improve
the generated code by hand without doing anything in terms of sophisticated
analysis.  One can rewrite inner loops using heroic code (load four characters
at a time together into a 32-bit register or even use MMX to handle eight
characters in parallel) but it isn't really necessary to do that to gain a
good improvement and start being competitive with C.
The backend is probably what is costly on some of the lazy bytestring and
native string code because there are too many tests to see if the buffer has
been exhausted (I think).  Some simple hoisting and common subexpression
elimination would go far here.
There may be things we can do at the library level too. We changed the
lazy bytestring representation in the last release cycle to remove a
number of tests. If you've a specific example would help though.
...
Thoroughness
------------
It really is necessary to benchmark more than one ghc/library combination,
otherwise the bad memory bugs wouldn't stand out so clearly or I would have
believed, as Don Stewart did, that he had fixed the memory allocation bug.
I fixed a particular issue with the library. But please make a bug
report if there are other issues wrt. ghc head in particularly.
...
Action Items
------------
Suggested short-term actions:
 o Can the allocator expand blocks?  If not, see if it can be implemented.
 o Find out why lazy bytestrings "by hand" are that slow - is it due to
   lack of hoisting in the backend?
Example please!
...
o Improve backend, mainly by not hiding information from register allocator.
   Alternatively, by using an assembly-to-assembly transformator that
   improves the register allocation.
Some nice examples that could help motivate the native gen hackers would
be useful.
...
o Fix -sstderr.
 o perhaps even let the RTS print out /proc/self/maps or /proc/self/status
   or "peak RSS" returned by getrusage() in the ru_maxrss field?
 o Find memory leaks in ghc 6.9 code.
Bug report time.
...
o ghc should allow round-tripping of the C-- code it generates.
 o would be awfully nice if the backend didn't sometimes throw the recognizable
   labels away.
 o can the core/stg code be made easier to read?
Can you ensure these points are summarised and directed to the ghc bug
tracker,

    http://hackage.haskell.org/trac/ghc/newticket?type=bug

to this work isn't lost on -cafe@.

-- Don