[14/16] SBM: Behind the measurements (rationale)

// I am getting sick and tired of working on this project and it's probably // better to get it fired off than polishing it any further. // // This email could benefit from being rewritten from a rough draft into a // well-crafted letter but that would take a couple of hours. // // So here it is, a lot rougher than I'd like -- but it *IS* :) why so big input files? easiest way to spot non-linearity and bad memory behaviour. Anyway, files should be big enough to overflow caches and kick the gc in. (short files interesting, too, but the big ones cause more complex behaviour of run-time system and CPU. If complex behaviour is behaved then simple behaviour probably is too -- but can still get its constant factors improved. If complex behaviour bad, then shouldn't that be fixed in any case?) waitpid4() has a struct w/ info about the child program's resource usage. unfortunately, the peakrss field is not filled in. Seems to be a general Unix problem. I've seen complaints on the net that Solaris doesn't fill it in, either. Other solution needed. pause-at-end, /proc/self/maps + /proc/self/status. VmmHWM = peak of VmmRSS, which is Resident (working) Set Size. It doesn't say what is shared with other processes or the operating system, though. In our case, we don't expect to share anything but some libraries -- which nobody else wants to share with us anyway (except for the C library). We are the only user of them. Discovered about a week ago that I could probably have used waitid() w/ WNOWAIT flag but didn't know. Was quick to write pause-at-end, anyway. It took about 15 minutes from the desire to know the peak memory use to having written and tested the first cut of it. Pause-at-end not completely bullet-proof in case of dyn libraries that get unloaded before the end of the program has been reached. On the other hand, plenty good enough for these tests + can conceivably allow more intricate poking around than waitid() solution. getting good measurements - eatmem, dd, probably should also dd library. good to have a "sacrificial run". Good to measure how good the measurements are (rel. std.dev + user/sys/real check). why average -- disturbances are mostly interrupts, daemons that everybody have anyway, slightly luckier/unluckier physical pages. These are real effects that nobody can control anyway. I'm not interested in the best possible times on an ideal, undisturbed machine with a helpful kernel. I'm interested in clean times under realistic circumstances. Therefore average instead of minimum. why I use real and not user/sys -- handling of blocking reads vs. mmap vs. madvise/fadvise vs. reading in separate thread in the future. User+sys would probably give me better numbers at the moment and I could change to real later. Still, I choose to stay with real (and the difference is marginal, anyway). Funny that the exact distribution of time between sys and user fluctuates a lot. In space-bslc8-lenfil-2 sys varies between 0.160s and 0.244s. Real is completely stable with 5x 1.396s and 1x 1.397s. look at /proc/interrupts, perhaps copy before/after to .intr? Warn if more than 100 (or 1000) Hz + 10%? write date/time + runlevel to platforminfo and/or sysinfo. barcharts why barcharts. should the time/mem barcharts be equal length? don't think so (hard to colour them in a text file. Would work with less -r and the console but not in an email or a text editor. Visual difference is good). But should perhaps not be /that/ different. visible markers if measurements bad. (5% real/user/sys check, typically within 0.1% on old laptop when doing a quick or thorough benchmark. Occasionally up to 1% - and 3% on c/byte-4k because it only takes 56ms in total.) prints out how tight the user/sys/real thing is. microarchitecture -- performance counters. Would be interesting to look at once the obvious performance problems have been handled. Let's fix the memory usage of bytestrings, the performance of lazy bytestrings, and start using registers in the machine code first. regularity of input file probably means that branch predictor on all three CPUs can remember pattern of spaces vs. non-spaces (or at least part of the pattern). Branch predictors not only use two-bit saturating counter for strongly non-taken/weakly non-taken/weakly taken/strongly taken. They also try to remember the pattern of jumps/non-jumps. A more realistic test would have less regular input file. This effect is very small given the current performance limiters, though. cache -- turned out to be pretty regular (by eyeballing cachegrind reports). Go up a factor of 10 in filesize and the number of access also went up a factor of 10. The miss ratios stayed the same. The miss ratios differed a bit between the benchmarks but I don't think it's time to look into that yet. The data are available, though, for those who can't wait to look into that. minor page faults we gather that through /usr/bin/time -- and could also get the same info from dumping the right file inside /proc/self/. Probably not important yet. Probably will be once all the low-lying fruit has been gathered up from the ground. More of a factor on slower OS'es than Linux. C files. Buffer size. reading it all in one go is slower than (re)using a small buffer. Cache effects, both in the operating system when copying (because the destination will be cached with a small buffer but non-cached with a big buffer) and in the application (everything will be cached with the small buffer, nothing with the big buffer). Note that at least the Core and the Athlon64 have automatic prefetchers that tries to fill the cache in advance so we don't have to wait for the cache misses. Doesn't quite seem to work. Older caches had a different write behaviour, they were write-through instead of the modern (lazy) write-back. For those caches the writing to the user-space buffer should be slow even when a small buffer is reused all the time (because we would have to wait for all the writes to be flushed out to main memory). C files. getchar/getchar_unlocked. A comparison with getchar() is NOT what the simple haskell program does [grammar!]. Thread-safe by default (cause of libraries). getchar_unlocked() is what the haskell programs do. getchar() and getchar_unlocked() use a single buffer for stdin. getwchar() and getwchar_unlocked() included at the insistence of wli. Much slower, because it is run-time dependent on locale (to choose encoding). Therefore, can't be a macro like getchar_unlocked() is. With an indirect jump, should be same speed as getchar() on Core and Athlon64 -- but curiously isn't. C and Haskell integer sizes and other limitations. Haskell uses unboxed 32-bit signed integers, except in lazy lenfil tests. Most of the C programs are simple and just use an int for the space count. One of them (space-megabuf)is more complicated. off_t is 64-bit. ssize_t is 32-bit. Potential overflow in c/space-megabuf. Potential 32-bit wrap-around in all my C tests. Same problem with all Haskell tests, except for the two lenfil tests that use lazy bytestrings, because they use a 64-bit int for the length of the intermediate string of just the spaces filtered out from stdin. Also potentially out of memory. In practice, they have almost the same limit because they use about 107MB for the 143MB input file. In other words, it will run out of virtual address space or RAM or swap at about the same time that the others will run out of bits in a 32-bit signed integer. -Peter
participants (1)
-
Peter Firefly Brodersen Lund