
General ------- Bytestrings are faster than haskell strings. This is good and expected. But their memory use, ouch! There seems to be a grave memory bug in 6.9. Lazy bytestrings are slower than strict bytestrings -- this was completely unexpected for me. The strictness analyzer works *very* well. It rarely made a difference to mark arguments as strict (and when it did, it was very easy to use). I/O --- The I/O patterns seen in the various programs are: 1) strict bytestrings = read increasingly large chunks (they double every time). I don't know if the already-read data have to be copied around or if the memory allocator can handle extending existing blocks (if they are at the "front" of the heap, say) like realloc() can. Even if the blocks do get copied around, that's not where the performance limiter is (speed-wise). It does seem to be a bit unwise in terms of memory pressure. Does the gc kick in every time the heap has to expand? Does it do a full collection then? If no other allocations have happened since the last collection than the allocation that caused the heap to expand, then perhaps skip the collection? (or some criteria similar to that) 2) lazy bytestrings = read blocks of 32K-8 bytes, as expected. The C benchmarks show that there's no penalty for reading 32K-8 vs. 32K. 3) native strings = read blocks 8K. The C benchmarks show that it barely matters if the block size is 4K, 32K, or 32K-8 bytes. In any case, the small differences due to block size are completely in the noise. Reading very large blocks, though, as the strict bytestrings do, actually has a cost (38-70% slower, depending on the CPU and kernel and RAM and busspeeds, etc -- see email 10 in the series). It is still peanets compared for almost all the benchmarks. Backend ------- The backend turns out to be rather bad. It is generally very easy to improve the generated code by hand without doing anything in terms of sophisticated analysis. One can rewrite inner loops using heroic code (load four characters at a time together into a 32-bit register or even use MMX to handle eight characters in parallel) but it isn't really necessary to do that to gain a good improvement and start being competitive with C. The backend is probably what is costly on some of the lazy bytestring and native string code because there are too many tests to see if the buffer has been exhausted (I think). Some simple hoisting and common subexpression elimination would go far here. For the simpler strict bytestring benchmarks, heroic backend optimizations are not necessary because they give less of an improvement over merely competent register use than sensible I/O and buffer management (cache effects start to be important once the compiler generates sensible code). I would have liked to examine backend performance further by playing with the generated C-- code but unfortunately, the C-- that ghc emits won't compile. Dissecting the generated machine code for the more complicated benchmarks was so hard that I skipped it -- I can't even recognize the labels! It's one thing if the backend generates lots of extra labels but another if the expected labels simply are not there! (this is why tools/cut.pl complains that it can't find the main loop for some of the programs -- if the */*.discut file is unexpectedly short, then that's the reason.) Quality of measurements ----------------------- The speed measurements are of extremely high quality, much more than it turns out is needed at this moment. I probably spent a bit too much effort there. The memory measurements are unusual. I don't recall having seen anything like them elsewhere because one seldom has to work with uncooperative programs when benchmarking (the preloading trick works with any program, as long as it is dynamically linked on Linux and uses the standard loader). They are probably the biggest single contribution from this work, together with the visualization provided by the bar charts. Thoroughness ------------ It really is necessary to benchmark more than one ghc/library combination, otherwise the bad memory bugs wouldn't stand out so clearly or I would have believed, as Don Stewart did, that he had fixed the memory allocation bug. It also matters what microarchitectures one use, as pointed out by email 10 in this series. And you gotta have many benchmarks, otherwise you might not really be measuring what you think you measure. I could probably have gotten away with fewer benchmarks in this case but that would have left a nagging feeling at the back of my head that there was a benchmark I /could/ have included that would have showed up a spectacular performance bug. Using so big input files (150MB) made the programs run slow enough that the timer granularity (1ms) was small compared to their runtime. It also made the memory performance problems really stand out (and be unignorable). I think that was a good idea. Having automatic feedback on the timing measurement quality helped a lot during the development of the testing harness because it made it easier and faster for me to eliminate the causes of imprecision. That's why I wrote the eatmem tool and that's why the test harness does a dd of the input file and the compiled program before running the tests. It should probably also dd the C library (and the few other libraries linked in: math, GNU multiprecision, and TLS stuff) but the quality estimators show that any improvement would be very small. Using top together with huge input files convinced me that -sstderr was untrustworthy so I developed the pause-at-end preloading hack. Paranoia paid off big time there. Visual presentation ------------------- The bar charts (with UTF-8) turned out to be so much more readably than raw numbers. This is a trick I learned from the Cairo team, which created a performance suite very early on. The bar is drawn with '%' if the measurement is bad. That turned out to be a much clearer visual indicator than my previous '*' and '!' flag characters. Really, there's no good reason not to use UTF-8 these days. I also discovered that the program names used for the benchmarks really matter. I had started out with names like 'cntspaces12' and 'cntbytes34' and kept getting confused about what program measured what. If the names are good, you don't need to put a description of each benchmark on the report. The order the benchmarks are presented in also matters. It is important that they are grouped and it is also important that the order is as stable as possible, even when not all runs contain exactly the same benchmarks. Separation of the time and memory bar charts + the separation of byte and space counting benchmarks also turned out to be useful. I had neither at first. And finally, merging of reports. I had no idea it would be so useful before I wrote the scripts for it. This is what makes the memory bugs in 6.9 really stand out and this is what shows the improvements of bytestring 0.9.0.2 (in particular with ghc 6.8.2). It also shows how wildly the performance of four different microarchitectures can differ. Scripting --------- Scripting everything is key to the success of an effort like this. I am VERY happy that scripted not just the running of the benchmarks but also the report handling and even the installation of the ghc/bytestring combinations. (I also scripted the sending of these 17 emails -- let's see if the script works :) ) The predictions elicited by my teaser email ------------------------------------------- As expected, predicting the performance of haskell programs is really hard, even modulo performance bugs. And, no, strings are not "silly". Action Items ------------ Suggested short-term actions: o Can the allocator expand blocks? If not, see if it can be implemented. o Find out why lazy bytestrings "by hand" are that slow - is it due to lack of hoisting in the backend? o Improve backend, mainly by not hiding information from register allocator. Alternatively, by using an assembly-to-assembly transformator that improves the register allocation. o Fix -sstderr. o perhaps even let the RTS print out /proc/self/maps or /proc/self/status or "peak RSS" returned by getrusage() in the ru_maxrss field? o Find memory leaks in ghc 6.9 code. o ghc should allow round-tripping of the C-- code it generates. o would be awfully nice if the backend didn't sometimes throw the recognizable labels away. o can the core/stg code be made easier to read? -Peter