December 2007 - Haskell-Cafe

[6/16] SBM: 6.9.20071124 Athlon Duron
by Peter Firefly Brodersen Lund 22 Dec '07

22 Dec '07

This set of measurements was captured by Daniel Fischer on one of his older machines, running SuSE 8.2, which has a Linux 2.4 kernel! The benchmarks were run on 2007-12-16 using ghc 6.9.20071124. Unfortunately, the ghc version is not quite the same as the one I've used for most measurements (6.9.20071119) so things may be a little different just for that reason. The results seemed a bit off at first, but now that I have graphs of all the runs on all the machines they don't seem strange at all. First of all, the memory use is about the same as on the other machines. Secondly, the timing differences for the C getchar/getwchar might be partly due to different versions of the C library. The remaining differences (a "steeper" profile than on the core duo and the Athlon64) may be due to different microarchitectures. -Peter Fischer's machine ghc 6.9.20071124 AMD Duron(tm) processor 1200.089 MHz TESTKIND=THOROUGH SUFFIX= Time (byte counting) std -------------------- avg dev slack hs/byte-bs----acc: 1.892 21‰ 0.4 ██▊ | hs/byte-bs----foldlx: 2.258 3‰ 0.1 ███▎ | hs/byte-bs----foldrx: 2.933 0‰ 0.1 ████▎ | hs/byte-bsl---acc: 14.319 45‰ 0.1 ████████████████████▌ | hs/byte-xxxxx-acc-1: 20.915 17‰ 4.0 █████████████████████████████▉ | hs/byte-xxxxx-acc-2: 20.691 8‰ 0.1 █████████████████████████████▌ | hs/byte-xxxxx-foldl: 20.610 5‰ 1.4 █████████████████████████████▍ | c/byte-getchar: 9.042 0‰ 0.1 ████████████▉ | c/byte-getchar-u: 1.314 3‰ 0.2 █▉ | c/byte-4k: 0.419 5‰ 0.5 ▋ | Memory: Peak ------- KB hs/byte-bs----acc: 147492 ████████████████████████████████████████ | hs/byte-bs----foldlx: 147492 ████████████████████████████████████████ | hs/byte-bs----foldrx: 147488 ████████████████████████████████████████ | hs/byte-bsl---acc: 2896 ▊ | hs/byte-xxxxx-acc-1: 1612 ▌ | hs/byte-xxxxx-acc-2: 1612 ▌ | hs/byte-xxxxx-foldl: 1612 ▌ | c/byte-getchar: 384 ▏ | c/byte-getchar-u: 384 ▏ | c/byte-4k: 380 ▏ | Time (space counting) std --------------------- avg dev slack hs/space-bs-c8-acc-1: 2.467 1‰ 0.3 ███▌ | hs/space-bs-c8-foldlx-1: 2.585 2‰ 0.1 ███▊ | hs/space-bs-c8-foldlx-2: 2.576 2‰ 0.3 ███▋ | hs/space-bs-c8-foldrx: 2.982 8‰ 2.3 ████▎ | hs/space-bs-c8-lenfil: 2.599 1‰ 0.2 ███▊ | hs/space-bslc8-acc-1: 15.228 8‰ 0.1 █████████████████████▊ | hs/space-bslc8-acc-2: 15.855 38‰ 0.0 ██████████████████████▋ | hs/space-bslc8-acc-3: 14.980 14‰ 0.0 █████████████████████▍ | hs/space-bslc8-chunk-1: 2.443 2‰ 0.2 ███▌ | hs/space-bslc8-chunk-2: 2.449 1‰ 0.3 ███▌ | hs/space-bslc8-chunk-3: 2.534 3‰ 0.3 ███▋ | hs/space-bslc8-foldl: 2.938 1‰ 0.2 ████▎ | hs/space-bslc8-foldlx-1: 2.928 1‰ 0.0 ████▏ | hs/space-bslc8-foldlx-2: 2.937 2‰ 0.2 ████▎ | hs/space-bslc8-foldr-1: 4.043 6‰ 0.1 █████▊ | hs/space-bslc8-foldr-2: 4.007 4‰ 0.1 █████▊ | hs/space-bslc8-lenfil-1: 3.240 1‰ 0.2 ████▋ | hs/space-bslc8-lenfil-2: 3.236 1‰ 0.2 ████▋ | hs/space-bsl---foldlx: 2.821 1‰ 0.1 ████ | hs/space-xxxxx-acc-1: 21.002 4‰ 0.1 ██████████████████████████████ | hs/space-xxxxx-acc-2: 21.270 22‰ 4.7 ██████████████████████████████▍ | hs/space-xxxxx-foldl: 20.934 1‰ 0.1 █████████████████████████████▉ | hs/space-xxxxx-lenfil: 25.915 3‰ 0.0 █████████████████████████████████████| c/space-getchar: 9.354 0‰ 0.0 █████████████▍ | c/space-getchar-u: 1.676 2‰ 0.2 ██▍ | c/space-4k: 1.293 2‰ 0.5 █▉ | c/space-megabuf: 1.830 3‰ 0.5 ██▋ | c/space-getwchar: 14.721 1‰ 0.1 █████████████████████ | c/space-getwchar-u: 4.814 0‰ 0.0 ██████▉ | c/space-32k: 1.276 2‰ 0.6 █▉ | c/space-32k-8: 1.275 2‰ 0.2 █▉ | Memory: Peak ------- KB hs/space-bs-c8-acc-1: 147488 ████████████████████████████████████████ | hs/space-bs-c8-foldlx-1: 147492 ████████████████████████████████████████ | hs/space-bs-c8-foldlx-2: 147492 ████████████████████████████████████████ | hs/space-bs-c8-foldrx: 147488 ████████████████████████████████████████ | hs/space-bs-c8-lenfil: 147492 ████████████████████████████████████████ | hs/space-bslc8-acc-1: 2896 ▊ | hs/space-bslc8-acc-2: 2896 ▊ | hs/space-bslc8-acc-3: 2896 ▊ | hs/space-bslc8-chunk-1: 65892 █████████████████▉ | hs/space-bslc8-chunk-2: 65892 █████████████████▉ | hs/space-bslc8-chunk-3: 76472 ████████████████████▊ | hs/space-bslc8-foldl: 86772 ███████████████████████▋ | hs/space-bslc8-foldlx-1: 86772 ███████████████████████▋ | hs/space-bslc8-foldlx-2: 86772 ███████████████████████▋ | hs/space-bslc8-foldr-1: 169360 ██████████████████████████████████████████████| hs/space-bslc8-foldr-2: 169360 ██████████████████████████████████████████████| hs/space-bslc8-lenfil-1: 110704 ██████████████████████████████▏ | hs/space-bslc8-lenfil-2: 110704 ██████████████████████████████▏ | hs/space-bsl---foldlx: 86776 ███████████████████████▋ | hs/space-xxxxx-acc-1: 1612 ▌ | hs/space-xxxxx-acc-2: 1612 ▌ | hs/space-xxxxx-foldl: 1612 ▌ | hs/space-xxxxx-lenfil: 1588 ▍ | c/space-getchar: 384 ▏ | c/space-getchar-u: 384 ▏ | c/space-4k: 412 ▏ | c/space-megabuf: 146904 ███████████████████████████████████████▉ | c/space-getwchar: 440 ▏ | c/space-getwchar-u: 440 ▏ | c/space-32k: 436 ▏ | c/space-32k-8: 436 ▏ |

1 0

[5/16] SBM: Support scripts and scriptlets
by Peter Firefly Brodersen Lund 22 Dec '07

22 Dec '07

Some of the scripts warrant a closer look. 'make zipdata' creates a nice tarball with all the data necessary to recreate a report AND to merge that report together with other reports, possible with rescaled bar charts. Very handy. All the files in the tarball are inside the 'ghc-measurements/' directory so the risk of things going wrong when unpacking the tarball is less. The names of the benchmarks are put in ghc-measurements/progs, mainly to ensure they end up in the right order when regenerating and merging reports. tools/genreport.pl [list of benchmarks to put in the report] It doesn't parse the command-line in any way because life is too short for command-line parsing. Instead, it is controlled via (too many) environment variables. ASCII - set to avoid using UTF-8 for bar charts and "per mille" character. NOSRC - the tool normally creates */*.srctimemem files containing the source code for each benchmark with bar charts for time/mem appended to the end. Setting this variable switches that off (necessary when regenerating and merging reports). EXCLUDE - disregard some of the benchmarks on the command line. Why is this necessary? Because it makes regenerating and merging reports easier. And because I was too lazy to filter the command line in tools/regenreport.sh and tools/merge.pl. FINDMAX - used by tools/merge.pl when rescaling. Outputs max time and max mem to stdout instead of the normal report. MAX_FILEWIDTH - used by tools/merge.pl to make merged reports look nice MAX_TIME, MAX_PEAKMEM - used by tools/merge.pl when rescaling Note that strictly speaking, there is a bug in the script(s) because it conflates the width of time/mem measurement represented as numbers (which you always want to take into account when merging) and MAX_TIME/MAX_PEAKMEM (which you only care about when rescaling). [FIXED now - 2007-12-21] tools/regenreport.sh unpacks a measurement tarball into a tmp directory and runs tools/genreport.pl to generate the report. Takes care not to disturb the normal files. tools/merge.pl [tarballs] Uses tools/regenreport.sh on each tarball in turn to generate a report which it reads in and stores on a benchmark-by-benchmark basis. At the end, synthetically combine all the pieces it cut out of the original report(s) into a brand-spanking new, merged report. Even the headers and the platforminfo at the top of each report is cut out and stored in data structures until they get spit out again at the end. The reading magic is in the state machine in gather(). It is not as bad as it looks. Some of the complications arise from marking repeated benchmark names as ' -- ', which improves the readability of the merged reports immensely. Another part of the complications arise due to the fact that not all tarballs contain the exact same benchmarks! Those that don't get a nice 'n/a' instead of numbers and a bar. And finally, the benchmarks should be in the right order. That is trickier than it sounds... When rescaling, tools/regenreport.sh is first run once for each tarball with the FINDMAX environment variable set. This results in tools/regenreport.sh outputting the maximum filename width, time, and peakmem for each tarball. ASCII - use ASCII instead of UTF-8 RESCALE - sometimes you want to rescale and sometimes you don't MAX_FILEWIDTH - if you want to force a specific width MAX_TIME, MAX_PEAKMEM - if you want to force a specific max -Peter

1 0

[4/16] SBM: How to use the Makefile (how to run benchmarks etc.)
by Peter Firefly Brodersen Lund 22 Dec '07

22 Dec '07

Introduction ------------ Most of the smarts of the benchmark harness is in the Makefile. If you want to rerun the benchmarks (or a single benchmark) or look at the intermediate code for a benchmark or the I/O trace or the memory consumption or the time spent or ... then you use the makefile. There are some support scripts in shell and Perl (and two C programs) that the Makefile uses to do its job. And there are some that you, the user, will want to interact directly with. The benchmarks are only expected to work on Linux. They have been tested on SuSE 8.2 (from 2003, with a 2.4 kernel), Ubuntu 7.04, and Ubuntu 7.10. Quick howto ----------- make phase1 -- compiles, generates test files, measures memory use. Safe to run on a busy machine if there's no active memory pressure. make phase2 -- timing runs. NOT safe to run on a busy machine. Should be run in runlevel 1 (= no X, no daemons, single-user mode) for best measurements. Outputs report at end. This is where you check the quality of the measurements. If you don't like them, run 'make redophase2' (or delete the .time and .stat files with low quality and run 'make phase2' again.) make zipdata -- make a tarball with all the measurements, suitable for emailing or putting on a website. The Makefile will beep after phase 1 and 2. The above will run a "NORMAL" run, which is fine during development if you want to see if you nailed a performance bug. It runs reasonably fast (about 43 seconds on my Athlon64). If you want better measurements, you should use: make TESTKIND=THOROUGH phase1 phase2 zipdata This will use a 150MB data file instead of a 15MB one and it will run the timing measurements 6 times (before throwing the first away) instead of 4 times (before throwing the first away). If you don't want to use single-user mode, you can improve the measurements by piping the output to a file (or run the test from the console) instead of involving a terminal and an X server (the screen update may kick in in the middle of a timing run and disturb things if for no other reason than their polluting the CPU caches). Filesystem layout ----------------- The benchmarks are in: hs/*.hs c/*.c hand/*.s hand/*.hs and hand/*.c are not compiled. The two *.hs files are the originals from which the tweaked assembly code has been derived. The two *.c files are sketches of how the MMX tweaks work (because MMX code by itself can be a bit off-putting). These are the support scripts: tools/genfiles.pl -- generate the test input files. tools/cutmem.pl tools/cutpid.pl -- both are used to disentangle the outputs of strace and pause-at-end (see below). I combine strace, memory info, and +RTS -sstderr into a single run to save time. This means that things end up in fewer files than I'd like. tools/cut.pl -- cut out main loop from disassembly ('make discut') tools/stat.pl -- looks at all timings for a single benchmark and calculates average and standard deviation and "time slack", that is the discrepancy between user+sys and real. It optionally throws away the first run. tools/eatmem.c -- allocates a chunk of memory and makes damn sure it really is in RAM! tools/pause-at-end.c -- part of a hack that copies /proc/self/maps and /proc/self/status to stderr just before a benchmark exits. tools/iosummary.pl -- takes an strace and sums up the I/O tools/genreport.pl -- generate a nice report with bar charts. Takes way too many options in the form of environment variables. tools/regenreport.sh -- regenerates the report from ANY measurement tarball. tools/merge.pl -- merge data from many measurement tarballs, with or without rescaling. Generated files: hs/*.core hs/*.stg hs/*.cmm hs/*.s -- intermediate code hs/*.hi -- "Haskell Interface" */*.o -- object code */* (the files in $(HSPROGS) $(CPROGS) $(HANDPROGS)) -- programs */*.dis */*.discut -- disassembled programs (and inner loops) */*.doc -- source + intermediate code + inner loops + timings */*.mem -- output from '+RTS -sstderr' + /proc/self/status + /proc/self/maps + output from /usr/bin/time (where the number of minor page faults is most interesting datum) */*.strace -- complete strace, taken together with */*.mem */*.iotrace -- only I/O operations from the strace (read/write/ select) */*.iosum -- summary of I/O operations */*.time -- time measurements */*.stat -- average + std.dev. + "time slack" */*.srctimespace -- source code + time/mem barchart (in ASCII) sysinfo -- description of the platform (uname, ghc, gcc, etc) platforminfo -- short description of the platform report.txt [8-16K] docs [1MB] -- sysinfo + all */*.doc concatenated Makefile targets ---------------- This is taken from 'make help': phase1 -- preparation + measurements that can run in background phase2 -- measurements that should run on unloaded machine redophase2 -- rerun phase2 doc, [ASCII=1] report, lastreport - reports zipdata -- zip up measurements (to ghc-measurements.tar.gz) prog,core,stg,cmm,asm,dis,discut -- compile, compile to core/stg/cmm/asm, disassemble, cut out main loop time,stat,mem,strace,iotrace,iosum,cache -- measure run-time, GHC heap + OS mem, syscalls, I/O patterns, cache cleartime, clean, distclean -- delete measurements etc TESTKIND=(SMOKETEST,NORMAL,THOROUGH), defaults to NORMAL STRACE=OLD, defaults to NEW Necessary tools --------------- Perl, sed, /usr/bin/time, bash (doesn't have to be the default shell as long as it its in PATH), strace, GNU Make, objdump (from the binutils package), gcc. Other things that could come in handy: A console and/or terminal that understands UTF-8. A less that understands UTF-8. An editor that understands UTF-8. A %!&# printing program that understands both UTF-8 and fonts. A2ps doesn't do UTF-8. Uniprint used 1) a proportional font which 2) didn't even have all the fractional-width blocks. U2ps used a with none of the block characters. I ended up resorting to gedit's print function :( That's enough for this email. -Peter

1 0

[3/16] SBM: The Makefile
by Peter Firefly Brodersen Lund 22 Dec '07

22 Dec '07

This is the entire Makefile. It perhaps ought to be sent as an attachment but my hacky mailer script wouldn't like it. A few of the lines are wider than 80 columns, unfortunately. -Peter # GHC benchmarks of parsing (bytestring, basic code generation, I/O). # # Copyright 2007 Peter Lund <firefly(a)vax64.dk>, licensed under GPLv2. # # # You will need the following tools: # perl, strace, /usr/bin/time, bash, a gcc that uses shared libraries and libc # (not dietlibc, klibc, uclibc), objdump (usually found in the binutils # package). # # A benchmark run on a new platform can be split into two phases: # # phase1: Compiles, dices, and slices the code in various ways. If this # completes you can be pretty sure that everything works all right. # Performs non-timing sensitive measurements. # # phase2: Performs timing sensitive measurements. It is a good idea to run # this phase on an idle machine, preferably without using X. # For example, you can log out of X and run "telinit 1" to get to # single-user mode. If you do so, please remember to set the correct # path to your ghc compiler. # Make a few bits of the makefile less noisy. Q:=@ ######################################################### # Ask ghc to optimize and warn GHCFLAGS= -O2 -W # Some newer versions of gcc prefer -Wextra -Wall GCCWARNFLAGS=-W -Wall # Default compilers CC=gcc GHC=ghc GHCPKG=ghc-pkg ######################################################### HSPROGS=hs/byte-bs----acc \ hs/byte-bs----foldlx \ hs/byte-bs----foldrx \ hs/byte-bsl---acc \ hs/byte-xxxxx-acc-1 \ hs/byte-xxxxx-acc-2 \ hs/byte-xxxxx-foldl \ \ hs/space-bs-c8-acc-1 \ hs/space-bs-c8-count \ hs/space-bs-c8-foldlx-1 \ hs/space-bs-c8-foldlx-2 \ hs/space-bs-c8-foldrx \ hs/space-bs-c8-lenfil \ hs/space-bslc8-acc-1 \ hs/space-bslc8-acc-2 \ hs/space-bslc8-acc-3 \ hs/space-bslc8-chunk-1 \ hs/space-bslc8-chunk-2 \ hs/space-bslc8-chunk-3 \ hs/space-bslc8-chunk-4 \ hs/space-bslc8-count \ hs/space-bslc8-foldl \ hs/space-bslc8-foldlx-1 \ hs/space-bslc8-foldlx-2 \ hs/space-bslc8-foldr-1 \ hs/space-bslc8-foldr-2 \ hs/space-bslc8-lenfil-1 \ hs/space-bslc8-lenfil-2 \ hs/space-bsl---foldlx \ hs/space-xxxxx-acc-1 \ hs/space-xxxxx-acc-2 \ hs/space-xxxxx-foldl \ hs/space-xxxxx-lenfil # RMPROGS keeps track of programs that are not always included in the tests. # We do want 'make clean' to delete them even when they are not currently # part of the build (they may be left over from a previous build). # stack overflow with long4. #HSPROGS:=$(HSPROGS) hs/byte-xxxxx-foldr-1 RMPROGS:=$(RMPROGS) hs/byte-xxxxx-foldr-1 # stack overflow with long4. #HSPROGS:=$(HSPROGS) hs/byte-xxxxx-foldr-2 RMPROGS:=$(RMPROGS) hs/byte-xxxxx-foldr-2 # stack overflow with long4. #HSPROGS:=$(HSPROGS) hs/space-xxxxx-foldr-1 RMPROGS:=$(RMPROGS) hs/space-xxxxx-foldr-1 # stack overflow with long4. #HSPROGS:=$(HSPROGS) hs/space-xxxxx-foldr-2 RMPROGS:=$(RMPROGS) hs/space-xxxxx-foldr-2 HANDPROGS= hand/byte-bs----acc-a \ hand/byte-bs----acc-b \ hand/byte-bs----acc-c \ hand/byte-bs----acc-d \ \ hand/space-bs-c8-acc-1-a \ hand/space-bs-c8-acc-1-b \ hand/space-bs-c8-acc-1-c \ hand/space-bs-c8-acc-1-d \ hand/space-bs-c8-acc-1-e \ hand/space-bs-c8-acc-1-f \ hand/space-bs-c8-acc-1-g \ hand/space-bs-c8-acc-1-h \ hand/space-bs-c8-acc-1-i \ hand/space-bs-c8-acc-1-j \ hand/space-bs-c8-acc-1-k \ hand/space-bs-c8-acc-1-l \ hand/space-bs-c8-acc-1-m \ hand/space-bs-c8-acc-1-n \ hand/space-bs-c8-acc-1-o \ hand/space-bs-c8-acc-1-p \ hand/space-bs-c8-acc-1-q \ hand/space-bs-c8-acc-1-r \ hand/space-bs-c8-acc-1-s RMPROGS:=$(RMPROGS) $(HANDPROGS) ifeq ($(shell $(GHCPKG) list | grep bytestring),) # ghc 6.6.1 with an old version of bytestring in 'base' but without its own # module name HSPROGS:=$(shell printf "%s\n" $(HSPROGS) | grep -v '.*-chunk-.*') endif HANDTEXT:=including hand-tweaked assembly ifeq ($(shell $(GHC) --version | grep 6.9.20071119),) HANDPROGS:= HANDTEXT:=no hand-tweaked assembly endif ifneq ($(SUFFIX),) HANDPROGS:= HANDTEXT:=no hand-tweaked assembly endif CPROGS= c/byte-getchar c/byte-getchar-u c/byte-4k \ \ c/space-getchar c/space-getchar-u c/space-4k \ c/space-megabuf c/space-getwchar c/space-getwchar-u \ c/space-32k c/space-32k-8 ######################################################### # The benchmarks can be run in three modes. The default can be overridden from # command line: # # make TESTKIND=SMOKETEST phase1 # # just tests the test suite, as fast as possible #TESTKINDDEFAULT=SMOKETEST # small test TESTKINDDEFAULT=NORMAL # very thorough test #TESTKINDDEFAULT=THOROUGH TESTKIND=$(TESTKINDDEFAULT) ifeq ($(TESTKIND),THOROUGH) TESTFILE= testfiles/long4 TESTFILECACHE= testfiles/long3 else ifeq ($(TESTKIND),NORMAL) TESTFILE= testfiles/long3 TESTFILECACHE= testfiles/long2 else ifeq ($(TESTKIND),SMOKETEST) TESTFILE= testfiles/long2 TESTFILECACHE= testfiles/long1 endif endif endif # Older versions of strace don't support the -E parameter which we use to # set LD_PRELOAD before running the straced command (so we can get by with # a single run of each benchmark in phase 1 instead of 2). # # Override with STRACE=OLD on the command line if you need to work with an # old strace. STRACE=NEW ######################################################### .PHONY: XXXXFIRST testfiles core stg cmm asm dis discut prog \ time stat strace iotrace iosum mem cache doc phase1 phase2 redophase2 XXXXFIRST: help testfiles: testfiles/long1 testfiles/long2 testfiles/long3 testfiles/long4 core: $(addsuffix .core ,$(HSPROGS) ) stg: $(addsuffix .stg ,$(HSPROGS) ) cmm: $(addsuffix .cmm ,$(HSPROGS) ) asm: $(addsuffix .s ,$(HSPROGS) $(CPROGS)) dis: $(addsuffix .dis ,$(HSPROGS) $(HANDPROGS) $(CPROGS)) discut: $(addsuffix .discut ,$(HSPROGS) $(HANDPROGS) $(CPROGS)) prog: $(HSPROGS) $(HANDPROGS) $(CPROGS) time: prog \ testfiles \ $(addsuffix .time ,$(HSPROGS) $(HANDPROGS) $(CPROGS)) stat: $(addsuffix .stat ,$(HSPROGS) $(HANDPROGS) $(CPROGS)) strace: $(addsuffix .strace ,$(HSPROGS) $(HANDPROGS) $(CPROGS)) iotrace:$(addsuffix .iotrace,$(HSPROGS) $(HANDPROGS) $(CPROGS)) iosum: $(addsuffix .iosum ,$(HSPROGS) $(HANDPROGS) $(CPROGS)) mem: $(addsuffix .mem ,$(HSPROGS) $(HANDPROGS) $(CPROGS)) cache: $(addsuffix .cache ,$(HSPROGS) $(HANDPROGS) $(CPROGS)) doc: $(addsuffix .doc ,$(HSPROGS) $(HANDPROGS) $(CPROGS)) ### phase1: testfiles tools/eatmem prog iosum mem printf "Done!\07" # beep phase2: time report printf "Done!\07" # beep redophase2: cleartime rm -f */*.srctimespace report.txt $(MAKE) phase2 ######################################################### testfiles/long1: mkdir -p testfiles tools/genfiles.pl 10000 > "$@" testfiles/long2: mkdir -p testfiles tools/genfiles.pl 100000 > "$@" testfiles/long3: mkdir -p testfiles tools/genfiles.pl 1000000 > "$@" testfiles/long4: mkdir -p testfiles tools/genfiles.pl 10000000 > "$@" ######################################################### tools/eatmem: tools/eatmem.c $(CC) $(GCCWARNFLAGS) -O2 "$<" -o "$@" tools/pause-at-end.so: tools/pause-at-end.c $(CC) $(GCCWARNFLAGS) -shared -ldl "$<" -o "$@" ######################################################### docs: core stg cmm asm discut time iotrace doc sysinfo rm -f docs cat */*.doc sysinfo > docs hs/%.doc: hs/%.core hs/%.stg hs/%.cmm hs/%.s hs/%.discut hs/%.time (export F="$(basename $@)" ; \ printf "\n" ; \ printf "*********************************************\n" ; \ printf "****\n" ; \ printf "**** %s:\n" "$$F" ; \ printf "****\n" ; \ printf "*********************************************\n\n" ; \ printf "Haskell code:\n\n" ; \ cat "$$F.hs" ; \ printf -- "-------------------------------\n" ; \ printf "%s:\n" "$$F.core" ; \ cat "$$F.core" ; \ printf -- "-------------------------------\n" ; \ printf "%s:\n" "$$F.stg" ; \ cat "$$F.stg" ; \ printf -- "-------------------------------\n" ; \ printf "%s:\n" "$$F.cmm" ; \ cat "$$F.cmm" ; \ printf -- "-------------------------------\n" ; \ printf "%s:\n" "$$F.s" ; \ cat "$$F.s" ; \ printf -- "-------------------------------\n" ; \ printf "%s:\n" "$$F.discut" ; \ cat "$$F.discut" ; \ printf -- "-------------------------------\n" ; \ printf "%s:\n" "$$F.time" ; \ cat "$$F.time" ; \ printf -- "-------------------------------\n" ; \ printf "\014" ; \ ) >> "$@" hand/%.doc: hand/%.discut hand/%.time (export F="$(basename $@)" ; \ printf "\n" ; \ printf "*********************************************\n" ; \ printf "****\n" ; \ printf "**** %s:\n" "$$F" ; \ printf "****\n" ; \ printf "*********************************************\n\n" ; \ printf "Haskell code:\n\n" ; \ export X=`echo $$F | sed -e 's/[a-z]$$//'` ; \ cat "$$X.hs" ; \ printf -- "-------------------------------\n" ; \ printf "%s (hand tweaked):\n" "$$F.s" ; \ cat "$$F.s" ; \ printf -- "-------------------------------\n" ; \ printf "%s:\n" "$$F.discut" ; \ cat "$$F.discut" ; \ printf -- "-------------------------------\n" ; \ printf "%s:\n" "$$F.time" ; \ cat "$$F.time" ; \ printf -- "-------------------------------\n" ; \ printf "\014" ; \ ) >> "$@" c/%.doc: c/%.s c/%.discut c/%.time (export F="$(basename $@)" ; \ printf "\n" ; \ printf "*********************************************\n" ; \ printf "****\n" ; \ printf "**** %s:\n" "$$F" ; \ printf "****\n" ; \ printf "*********************************************\n\n" ; \ printf "C code:\n\n" ; \ cat "$$F.c" ; \ printf -- "-------------------------------\n" ; \ printf "%s:\n" "$$F.s" ; \ cat "$$F.s" ; \ printf -- "-------------------------------\n" ; \ printf "%s:\n" "$$F.discut" ; \ cat "$$F.discut" ; \ printf -- "-------------------------------\n" ; \ printf "%s:\n" "$$F.time" ; \ cat "$$F.time" ; \ printf -- "-------------------------------\n" ; \ printf "\014" ; \ ) >> "$@" ######################################################### .PHONY: report.txt lastreport report: report.txt $(Q)cat report.txt report.txt \ $(addsuffix .srctimespace,$(HSPROGS) $(HANDPROGS) $(CPROGS)): \ iosum mem time stat platforminfo $(Q)tools/genreport.pl $(HSPROGS) $(HANDPROGS) $(CPROGS) > report.txt lastreport: $(Q)cat report.txt ######################################################### # Probably all or most of the targets (left-hand sides) in these rules should # be mentioned in a .SECONDARY rule so make won't delete them behind our backs, # in its infinite wisdom. This is sometimes necessary when using pattern # rules (i.e. rules with '%' wildcards in them). # # For some reason, it doesn't seem to be all that necessary, although I had to # insert a couple of those .SECONDARY things earlier to make make behave. For # example, I had to insert this rule at some point but now things keep working # even when it's commented out: # # .SECONDARY: $(HSPROGS) $(HANDPROGS) $(CPROGS) # hs/%: hs/%.hs $(GHC) $(GHCFLAGS) --make -fforce-recomp "$<" -o "$@" hand/%: hand/%.s $(GHC) -no-hs-main "$<" -o "$@" -package bytestring c/%: c/%.c $(CC) $(GCCWARNFLAGS) -O2 "$<" -o "$@" ### %.dis: % @# Limit the disassembly for speed reasons (10x+ difference) and @# file size reasons (20x-30x difference). @# The stuff we are interested in comes early in the .text segment so @# there's no reason to disassemble the entire runtime system, which @# comes afterwards in case of hand/ and hs/ binaries. objdump -M intel -D --stop-address=0x08060000 "$<" > "$@" ### %.discut: %.dis tools/cut.pl < "$<" > "$@" ### %.core: %.hs $(GHC) $(GHCFLAGS) -c -ddump-simpl "$<" > "$@" ### %.stg: %.hs $(GHC) $(GHCFLAGS) -c -ddump-stg "$<" > "$@" ### %.cmm: %.hs $(GHC) $(GHCFLAGS) -c -ddump-cmm "$<" > "$@" ### %.s: %.hs $(GHC) $(GHCFLAGS) -c -fforce-recomp -keep-s-files "$<" .SECONDARY: $(addsuffix .s,$(CPROGS)) %.s: %.c $(CC) $(GCCWARNFLAGS) -O2 -S $< -o $@ ######################################################### # The first run is sacrificial, except when smoketesting where there only is # one run. ifeq ($(TESTKIND),THOROUGH) TIME= bash -c "time $<" < $(TESTFILE); \ bash -c "time $<" < $(TESTFILE); \ bash -c "time $<" < $(TESTFILE); \ bash -c "time $<" < $(TESTFILE); \ bash -c "time $<" < $(TESTFILE); \ bash -c "time $<" < $(TESTFILE) NOSKIP:= else ifeq ($(TESTKIND),NORMAL) TIME= bash -c "time $<" < $(TESTFILE); \ bash -c "time $<" < $(TESTFILE); \ bash -c "time $<" < $(TESTFILE); \ bash -c "time $<" < $(TESTFILE) NOSKIP:= else ifeq ($(TESTKIND),SMOKETEST) TIME= bash -c "time $<" < $(TESTFILE) NOSKIP:= NOSKIP=1 endif endif endif # In order to reduce the risk of swapping during the time test, we try to make # sure there's twice the test file size free (and a bit). # NEEDFREE is measured in kilobytes. NEEDFREE=$(shell expr 22 '*' `ls -s $(TESTFILE) | cut -f1 -d' '` / 10) %.time: % tools/eatmem $(TESTFILE) printf "%s\n" "$<" > "$@" tools/eatmem $(NEEDFREE) dd if=$(TESTFILE) of=/dev/null dd if="$<" of=/dev/null ($(TIME)) >>"$@" 2>&1 printf "%s\n\n" "-----" >> "$@" %.stat: %.time $(NOSKIP) tools/stat.pl < "$<" > "$@" %.mem %.strace: % tools/pause-at-end.so $(TESTFILE) ifeq ($(STRACE),OLD) strace -o tmp.strace -f \ /usr/bin/time "$<" +RTS -sstderr < $(TESTFILE) > $(basename $(a)).mem 2>&1 LD_PRELOAD=tools/pause-at-end.so \ "$<" +RTS -sstderr < $(TESTFILE) >> $(basename $(a)).mem 2>&1 else strace -o tmp.strace -ELD_PRELOAD=tools/pause-at-end.so -f \ /usr/bin/time "$<" +RTS -sstderr < $(TESTFILE) > $(basename $(a)).mem 2>&1 endif tools/cutmem.pl < $(basename $(a)).mem > tmp mv tmp $(basename $(a)).mem tools/cutpid.pl < tmp.strace > $(basename $(a)).strace rm -f tmp.strace #%.strace: %.mem # @echo > /dev/null %.iotrace: %.strace grep '^$read\|write\|select$' "$<" > "$@" %.iosum: %.iotrace tools/iosummary.pl < "$<" > "$@" %.cache: % $(TESTFILECACHE) valgrind --tool=cachegrind "$<" < $(TESTFILECACHE) 2> "$@" ######################################################### .PHONY: zipdata help cleartime clean distclean sysinfo: hostname > sysinfo cat /etc/*release >> sysinfo @echo >> sysinfo uname -a >> sysinfo @echo >> sysinfo cat /proc/cpuinfo >> sysinfo $(GHC) --version >> sysinfo echo >> sysinfo $(CC) --version >> sysinfo # This variable makes testing with weird /proc/cpuinfo files easier CPUINFO=/proc/cpuinfo platforminfo: hostname > platforminfo (printf 'ghc '; ($(GHC) --version | sed -ne 's/^The.*version //p')) >> platforminfo cat $(CPUINFO) | sed -ne '/model name.*:/ { s/model name.*: //p; q}' >> platforminfo printf "%s MHz\n" `cat $(CPUINFO) | sed -ne '/cpu MHz.*:/ { s/cpu MHz.*: //p; q}'` >> platforminfo printf "TESTKIND=$(TESTKIND)\n" >> platforminfo printf "SUFFIX=$(SUFFIX)\n" >> platforminfo zipdata: time stat mem strace iotrace iosum sysinfo report.txt rm -f ghc-measurements.tar.gz rm -rf ghc-measurements mkdir -p ghc-measurements cp --parents \ $(addprefix */*, .time .stat .mem .iosum) sysinfo report.txt platforminfo \ ghc-measurements printf "%s " "$(HSPROGS)" "$(HANDPROGS)" "$(CPROGS)" > ghc-measurements/progs tar -zcf ghc-measurements.tar.gz ghc-measurements rm -rf ghc-measurements help: @echo 'Measurements of very simple string I/O and parsing.' @printf ' (%d benchmarks, %s)\n' `echo $(HSPROGS) $(HANDPROGS) $(CPROGS) | wc -w` "$(HANDTEXT)" @echo '' @echo ' phase1 -- preparation + measurements that can run in background' @echo ' phase2 -- measurements that should run on unloaded machine' @echo ' redophase2 -- rerun phase2' @echo '' @echo ' doc, [ASCII=1] report, lastreport - reports' @echo ' zipdata -- zip up measurements (to ghc-measurements.tar.gz)' @echo '' @echo ' prog,core,stg,cmm,asm,dis,discut' @echo ' -- compile, compile to core/stg/cmm/asm, disassemble, cut out main loop' @echo ' time,stat,mem,strace,iotrace,iosum,cache' @echo ' -- measure run-time, GHC heap + OS mem, syscalls, I/O patterns, cache' @echo '' @echo ' cleartime, clean, distclean -- delete measurements etc' @echo '' @echo ' TESTKIND=(SMOKETEST,NORMAL,THOROUGH), defaults to $(TESTKINDDEFAULT)' @echo ' STRACE=OLD, defaults to NEW' cleartime: rm -f */*.time clean: # keep and hand/*.s ! rm -rf */*.hi */*.o *.o \ */*.core */*.stg */*.cmm hs/*.s c/*.s */*.dis */*.discut \ */*.hcr \ */*.time */*.stat \ */*.real \ */*.strace */*.iotrace */*.iosum \ */*.mem */*.cache cachegrind.out.* \ */*.doc */*.srctimespace \ tmp.strace tmp \ tools/eatmem tools/pause-at-end.so \ $(HSPROGS) $(CPROGS) $(HANDPROGS) $(RMPROGS) a.out \ testfiles/ \ ghc-measurements/ \ sysinfo platforminfo docs xx.ps distclean: clean rm -f *~ */*~ report.txt ghc-measurements.tar.gz

1 0

[2/16] SBM: Inner loops of the hand-tweaked assembly benchmarks
by Peter Firefly Brodersen Lund 22 Dec '07

22 Dec '07

I've taken the two benchmarks byte-bs----acc and space-bs-c8-acc-1 and gradually tweaked their inner loops from something that used memory all the time to something that used registers more and more efficiently. I've done this gradually, pretty much one register at a time. Along the way, I've also done a simple common subexpression/loop hoisting thing in which I combined the pointer to the start of the string and the index into the string into a single pointer. Doing this in real life may cause bad problems with the garbage collector. At the end, I go a bit mad and start doing heroic optimizations (reading four bytes at a time, using MMX registers to read 8 bytes at a time, twisted MMX math to keep 8 space counters in an MMX register + a bit of loop unrolling). Here follows first the two original inner loops and then the 23 hand-tweaked versions. I used the following shell code to isolate the inner loops: (for F in hs/byte-bs----acc.s hs/space-bs-c8-acc-1.s hand/*.s ; \ do echo "------------------------------"; \ echo "$F:"; \ echo ; \ cat "$F" | perl -e 'while(<>){ if (/Main_zdwcnt_info:/ .. /.section .data/) { print; }}' | head -n-1; \ done; \ echo "=============================="; \ ) > xx.txt -Peter ------------------------------ hs/byte-bs----acc.s: Main_zdwcnt_info: .LcYL: cmpl $0,16(%ebp) jle .LcYO movl 12(%ebp),%eax incl %eax movl (%ebp),%ecx incl %ecx subl $1,16(%ebp) movl %eax,12(%ebp) movl %ecx,(%ebp) jmp Main_zdwcnt_info .LcYO: movl (%ebp),%esi addl $20,%ebp jmp *(%ebp) ------------------------------ hs/space-bs-c8-acc-1.s: Main_zdwcnt_info: .Lc16u: cmpl $0,16(%ebp) jle .Lc16x movl 4(%ebp),%eax movl 12(%ebp),%ecx movzbl (%eax,%ecx,1),%eax cmpl $32,%eax jne .Lc16F movl 12(%ebp),%eax incl %eax movl (%ebp),%ecx incl %ecx subl $1,16(%ebp) movl %eax,12(%ebp) movl %ecx,(%ebp) jmp Main_zdwcnt_info .Lc16x: movl (%ebp),%esi addl $20,%ebp jmp *(%ebp) .Lc16F: movl 12(%ebp),%eax incl %eax subl $1,16(%ebp) movl %eax,12(%ebp) jmp Main_zdwcnt_info ------------------------------ hand/byte-bs----acc-a.s: Main_zdwcnt_info: .LcYN: cmpl $0,16(%ebp) jle .LcYQ movl 00(%ebp),%ecx movl 12(%ebp),%eax movl 16(%ebp),%edx incl %ecx incl %eax decl %edx movl %ecx,00(%ebp) movl %eax,12(%ebp) movl %edx,16(%ebp) jmp Main_zdwcnt_info .LcYQ: movl (%ebp),%esi addl $20,%ebp jmp *(%ebp) ------------------------------ hand/byte-bs----acc-b.s: Main_zdwcnt_info: .LcYN: cmpl $0,16(%ebp) jle .LcYQ movl 00(%ebp),%ecx movl 12(%ebp),%eax movl 16(%ebp),%edx .L_again: cmpl $0,%edx jle .L_out incl %ecx incl %eax decl %edx jmp .L_again .L_out: movl %ecx,00(%ebp) movl %eax,12(%ebp) movl %edx,16(%ebp) jmp Main_zdwcnt_info .LcYQ: movl (%ebp),%esi addl $20,%ebp jmp *(%ebp) ------------------------------ hand/byte-bs----acc-c.s: Main_zdwcnt_info: .LcYN: cmpl $0,16(%ebp) jle .LcYQ movl 00(%ebp),%ecx movl 12(%ebp),%eax movl 16(%ebp),%edx cmpl $0,%edx jle .L_out .L_again: incl %ecx incl %eax decl %edx cmpl $0,%edx jg .L_again .L_out: movl %ecx,00(%ebp) movl %eax,12(%ebp) movl %edx,16(%ebp) jmp Main_zdwcnt_info .LcYQ: movl (%ebp),%esi addl $20,%ebp jmp *(%ebp) ------------------------------ hand/byte-bs----acc-d.s: Main_zdwcnt_info: .LcYN: cmpl $0,16(%ebp) jle .LcYQ movl 00(%ebp),%ecx movl 12(%ebp),%eax movl 16(%ebp),%edx cmpl $0,%edx jle .L_out .align 16 .L_again: incl %ecx incl %eax decl %edx cmpl $0,%edx jg .L_again .L_out: movl %ecx,00(%ebp) movl %eax,12(%ebp) movl %edx,16(%ebp) jmp Main_zdwcnt_info .LcYQ: movl (%ebp),%esi addl $20,%ebp jmp *(%ebp) ------------------------------ hand/space-bs-c8-acc-1-a.s: Main_zdwcnt_info: .Lc16w: cmpl $0,16(%ebp) jle .Lc16z movl 4(%ebp),%eax movl 12(%ebp),%ecx movzbl (%eax,%ecx,1),%eax cmpl $32,%eax jne .Lc16H movl 12(%ebp),%eax incl %eax movl (%ebp),%ecx incl %ecx subl $1,16(%ebp) movl %eax,12(%ebp) movl %ecx,(%ebp) jmp Main_zdwcnt_info .Lc16H: movl 12(%ebp),%eax incl %eax subl $1,16(%ebp) movl %eax,12(%ebp) jmp Main_zdwcnt_info .Lc16z: movl (%ebp),%esi addl $20,%ebp jmp *(%ebp) ------------------------------ hand/space-bs-c8-acc-1-b.s: Main_zdwcnt_info: .Lc16w: cmpl $0,16(%ebp) jle .Lc16z movl 4(%ebp),%eax movl 12(%ebp),%ecx movzbl (%eax,%ecx,1),%eax cmpl $32,%eax je .Lc16H movl 12(%ebp),%eax incl %eax subl $1,16(%ebp) movl %eax,12(%ebp) jmp Main_zdwcnt_info .Lc16H: movl 12(%ebp),%eax incl %eax movl (%ebp),%ecx incl %ecx subl $1,16(%ebp) movl %eax,12(%ebp) movl %ecx,(%ebp) jmp Main_zdwcnt_info .Lc16z: movl (%ebp),%esi addl $20,%ebp jmp *(%ebp) ------------------------------ hand/space-bs-c8-acc-1-c.s: Main_zdwcnt_info: .Lc16w: cmpl $0,16(%ebp) jle .Lc16z movl 4(%ebp),%eax movl 12(%ebp),%ecx movzbl (%eax,%ecx,1),%eax cmpl $32,%eax jne .Lc16H movl (%ebp),%ecx incl %ecx movl 12(%ebp),%eax incl %eax movl %ecx,(%ebp) movl %eax,12(%ebp) subl $1,16(%ebp) jmp Main_zdwcnt_info .Lc16z: movl (%ebp),%esi addl $20,%ebp jmp *(%ebp) .Lc16H: movl 12(%ebp),%eax incl %eax movl %eax,12(%ebp) subl $1,16(%ebp) jmp Main_zdwcnt_info ------------------------------ hand/space-bs-c8-acc-1-d.s: Main_zdwcnt_info: .Lc16w: cmpl $0,16(%ebp) jle .Lc16z movl 4(%ebp),%eax movl 12(%ebp),%ecx movzbl (%eax,%ecx,1),%eax cmpl $32,%eax jne .Lc16H addl $1,(%ebp) addl $1,12(%ebp) subl $1,16(%ebp) jmp Main_zdwcnt_info .Lc16z: movl (%ebp),%esi addl $20,%ebp jmp *(%ebp) .Lc16H: addl $1,12(%ebp) subl $1,16(%ebp) jmp Main_zdwcnt_info ------------------------------ hand/space-bs-c8-acc-1-e.s: Main_zdwcnt_info: .Lc16w: cmpl $0,16(%ebp) jle .Lc16z movl 4(%ebp),%eax movl 12(%ebp),%ecx movzbl (%eax,%ecx,1),%eax cmpl $32,%eax jne .Lc16H movl 12(%ebp),%eax incl %eax incl %ecx movl (%ebp),%eax incl %eax subl $1,16(%ebp) movl %ecx,12(%ebp) movl %eax,(%ebp) jmp Main_zdwcnt_info .Lc16z: movl (%ebp),%esi addl $20,%ebp jmp *(%ebp) .Lc16H: incl %ecx subl $1,16(%ebp) movl %ecx,12(%ebp) jmp Main_zdwcnt_info ------------------------------ hand/space-bs-c8-acc-1-f.s: Main_zdwcnt_info: .Lc16w: cmpl $0,16(%ebp) jle .Lc16z movl 4(%ebp),%eax movl 12(%ebp),%ecx movzbl (%eax,%ecx,1),%eax cmpl $32,%eax jne .Lc16H incl %ecx subl $1,16(%ebp) addl $1,(%ebp) movl %ecx,12(%ebp) jmp Main_zdwcnt_info .Lc16z: movl (%ebp),%esi addl $20,%ebp jmp *(%ebp) .Lc16H: incl %ecx subl $1,16(%ebp) movl %ecx,12(%ebp) jmp Main_zdwcnt_info ------------------------------ hand/space-bs-c8-acc-1-g.s: Main_zdwcnt_info: movl (%ebp),%esi .Lc16w: cmpl $0,16(%ebp) jle .Lc16z movl 4(%ebp),%eax movl 12(%ebp),%ecx movzbl (%eax,%ecx,1),%eax cmpl $32,%eax jne .Lc16H incl %ecx subl $1,16(%ebp) inc %esi movl %ecx,12(%ebp) jmp .Lc16w .Lc16z: addl $20,%ebp jmp *(%ebp) .Lc16H: incl %ecx subl $1,16(%ebp) movl %ecx,12(%ebp) jmp .Lc16w ------------------------------ hand/space-bs-c8-acc-1-h.s: Main_zdwcnt_info: movl (%ebp),%esi movl 12(%ebp),%ecx .Lc16w: cmpl $0,16(%ebp) jle .Lc16z movl 4(%ebp),%eax movzbl (%eax,%ecx,1),%eax cmpl $32,%eax jne .Lc16H incl %ecx subl $1,16(%ebp) inc %esi jmp .Lc16w .Lc16z: addl $20,%ebp jmp *(%ebp) .Lc16H: incl %ecx subl $1,16(%ebp) jmp .Lc16w ------------------------------ hand/space-bs-c8-acc-1-i.s: Main_zdwcnt_info: movl (%ebp),%esi movl 12(%ebp),%ecx movl 16(%ebp),%edx .Lc16w: cmpl $0,%edx jle .Lc16z movl 4(%ebp),%eax movzbl (%eax,%ecx,1),%eax cmpl $32,%eax jne .Lc16H incl %ecx decl %edx inc %esi jmp .Lc16w .Lc16z: addl $20,%ebp jmp *(%ebp) .Lc16H: incl %ecx decl %edx jmp .Lc16w ------------------------------ hand/space-bs-c8-acc-1-j.s: Main_zdwcnt_info: movl (%ebp),%esi movl 4(%ebp),%ecx addl 12(%ebp),%ecx movl 16(%ebp),%edx .Lc16w: cmpl $0,%edx jle .Lc16z movzbl (%ecx),%eax cmpl $32,%eax jne .Lc16H incl %ecx decl %edx inc %esi jmp .Lc16w .Lc16z: addl $20,%ebp jmp *(%ebp) .Lc16H: incl %ecx decl %edx jmp .Lc16w ------------------------------ hand/space-bs-c8-acc-1-k.s: Main_zdwcnt_info: movl (%ebp),%esi movl 4(%ebp),%ecx addl 12(%ebp),%ecx movl 16(%ebp),%edx .Lc16w: cmpl $0,%edx jle .Lc16z movzbl (%ecx),%eax cmpl $32,%eax jne .Lc16H incl %ecx decl %edx inc %esi jmp .Lc16w .Lc16H: incl %ecx decl %edx jmp .Lc16w .Lc16z: addl $20,%ebp jmp *(%ebp) ------------------------------ hand/space-bs-c8-acc-1-l.s: Main_zdwcnt_info: movl (%ebp),%esi movl 4(%ebp),%ecx addl 12(%ebp),%ecx movl 16(%ebp),%edx .Lc16w: cmpl $0,%edx jle .Lc16z movzbl (%ecx),%eax incl %ecx decl %edx cmpl $32,%eax jne .Lc16H inc %esi jmp .Lc16w .Lc16H: jmp .Lc16w .Lc16z: addl $20,%ebp jmp *(%ebp) ------------------------------ hand/space-bs-c8-acc-1-m.s: Main_zdwcnt_info: movl (%ebp),%esi movl 4(%ebp),%ecx addl 12(%ebp),%ecx movl 16(%ebp),%edx .Lc16w: cmpl $0,%edx jle .Lc16z movzbl (%ecx),%eax incl %ecx decl %edx cmpl $32,%eax jne .Lc16w inc %esi jmp .Lc16w .Lc16z: addl $20,%ebp jmp *(%ebp) ------------------------------ hand/space-bs-c8-acc-1-n.s: Main_zdwcnt_info: movl (%ebp),%esi movl 4(%ebp),%ecx addl 12(%ebp),%ecx movl 16(%ebp),%edx .Lc16w: cmpl $0,%edx jle .Lc16z .Lc16xx: movzbl (%ecx),%eax incl %ecx decl %edx cmpl $32,%eax jne .Lc16w inc %esi cmpl $0,%edx jg .Lc16xx .Lc16z: addl $20,%ebp jmp *(%ebp) ------------------------------ hand/space-bs-c8-acc-1-o.s: Main_zdwcnt_info: movl (%ebp),%esi movl 4(%ebp),%ecx addl 12(%ebp),%ecx movl 16(%ebp),%edx .Lc16w: cmpl $0,%edx jle .Lc16z .Lc16xx: movzbl (%ecx),%eax incl %ecx decl %edx cmpl $32,%eax jne .Lc16w inc %esi cmpl $0,%edx jle .Lc16z movzbl (%ecx),%eax incl %ecx decl %edx cmpl $32,%eax jne .Lc16w inc %esi cmpl $0,%edx jg .Lc16xx .Lc16z: addl $20,%ebp jmp *(%ebp) ------------------------------ hand/space-bs-c8-acc-1-p.s: Main_zdwcnt_info: movl (%ebp),%esi movl 4(%ebp),%ecx addl 12(%ebp),%ecx movl 16(%ebp),%edx .Lc16w4: cmpl $4,%edx jl .Lc16wxx movl (%ecx),%eax addl $4,%ecx subl $4,%edx cmpb $32,%al jne .Lc16wa incl %esi .Lc16wa: cmpb $32,%ah jne .Lc16wb incl %esi .Lc16wb: shrl $16,%eax cmpb $32,%al jne .Lc16wc incl %esi .Lc16wc: cmpb $32,%ah jne .Lc16w4 incl %esi jmp .Lc16w4 .Lc16w1: cmpl $0,%edx jle .Lc16z .Lc16wxx: movzbl (%ecx),%eax incl %ecx decl %edx cmpl $32,%eax jne .Lc16w1 inc %esi jmp .Lc16w1 .Lc16z: addl $20,%ebp jmp *(%ebp) ------------------------------ hand/space-bs-c8-acc-1-q.s: Main_zdwcnt_info: movl (%ebp),%esi /* #spaces found */ movl 4(%ebp),%ecx /* ptr */ addl 12(%ebp),%ecx /* ... + idx */ movl 16(%ebp),%edx /* cnt of remaining bytes */ emms /* clear fp tags so we can use mmx instrs */ mov $0x20202020,%eax movd %eax,%mm1 /* mm1: 0000000020202020 */ movq %mm1,%mm0 /* mm0: 0000000020202020 */ psllq $32,%mm1 /* mm1: 2020202000000000 */ por %mm0,%mm1 /* mm1: 2020202020202020 */ mov $0x01010101,%eax movd %eax,%mm2 /* mm2: 0000000001010101 */ movq %mm2,%mm0 /* mm0: 0000000001010101 */ psllq $32,%mm2 /* mm2: 0101010100000000 */ por %mm0,%mm2 /* mm2: 0101010101010101 */ /* MMX loads can use any alignment (potentially at a speed-hit) */ /* this loop looks at 8 bytes at a time */ .Lc16w8: cmpl $8,%edx jl .Lc16w1 movq (%ecx),%mm0 /* mm0 holds 8 characters */ addl $8,%ecx subl $8,%edx pcmpeqb %mm1,%mm0 /* cmp byte for byte with ' ' */ /* the result flag is 00 or FF */ pand %mm2,%mm0 /* turn FF into 01, which is actually useful */ /* if we could just add the bytes up horizontally in %mm0, sigh.. .*/ movd %mm0,%eax push %eax add %ah, %al and $0x03,%eax add %eax,%esi pop %eax shr $16,%eax add %ah,%al and $0x03,%eax add %eax,%esi psrlq $32,%mm0 movd %mm0,%eax push %eax add %ah, %al and $0x03,%eax add %eax,%esi pop %eax shr $16,%eax add %ah,%al and $0x03,%eax add %eax,%esi jmp .Lc16w8 /* this loop looks at one byte at a time to handle the remainder */ .Lc16w1: cmpl $0,%edx jle .Lc16z movzbl (%ecx),%eax incl %ecx decl %edx cmpl $32,%eax jne .Lc16w1 inc %esi jmp .Lc16w1 /* done, remember to clear fp/mmx tags with emms */ .Lc16z: emms addl $20,%ebp jmp *(%ebp) ------------------------------ hand/space-bs-c8-acc-1-r.s: Main_zdwcnt_info: movl (%ebp),%esi /* #spaces found */ movl 4(%ebp),%ecx /* ptr */ addl 12(%ebp),%ecx /* ... + idx */ movl 16(%ebp),%edx /* cnt of remaining bytes */ emms /* clear fp tags so we can use mmx instrs */ mov $0x20202020,%eax movd %eax,%mm1 /* mm1: 0000000020202020 */ movq %mm1,%mm0 /* mm0: 0000000020202020 */ psllq $32,%mm1 /* mm1: 2020202000000000 */ por %mm0,%mm1 /* mm1: 2020202020202020 */ mov $0x01010101,%eax movd %eax,%mm2 /* mm2: 0000000001010101 */ movq %mm2,%mm0 /* mm0: 0000000001010101 */ psllq $32,%mm2 /* mm2: 0101010100000000 */ por %mm0,%mm2 /* mm2: 0101010101010101 */ /* MMX loads can use any alignment (potentially at a speed-hit) */ /* therefore we don't have to try to read 1-7 bytes one at a time */ /* first in order to end up with an aligned %ecx. */ .Lc16_mainloop: cmpl $8,%edx jl .Lc16w1 movl %edx,%eax shr $3,%eax cmpl $127,%eax jle .Lc16_127 movl $127,%eax .Lc16_127: shl $3,%eax sub %eax,%edx shr $3,%eax pxor %mm3,%mm3 /* clear block of space counters */ /* loop up to 127 times in a loop that looks at 8 bytes at a time. */ /* Going above 255 could overflow the 8 counters in mm3. */ /* Going above 127 could overflow the horizontal summation code. */ .Lc16w8: cmpl $0,%eax jle .Lc16w8end movq (%ecx),%mm0 /* mm0 holds 8 characters */ addl $8,%ecx decl %eax pcmpeqb %mm1,%mm0 /* cmp byte for byte with ' ' */ /* the result flag is 00 or FF */ pand %mm2,%mm0 /* turn FF into 01, which is actually useful */ paddb %mm0,%mm3 /* add to the 8 space counters */ jmp .Lc16w8 .Lc16w8end: /* sum the 8 space counters in mm3 and add to %esi */ /* if only MMX had horizontal byte adds... */ movd %mm3,%eax push %eax add %ah, %al /* NOTE! potential overflow! */ and $0xFF,%eax add %eax,%esi pop %eax shr $16,%eax add %ah,%al /* NOTE! potential overflow! */ and $0xFF,%eax add %eax,%esi psrlq $32,%mm3 movd %mm3,%eax push %eax add %ah, %al /* NOTE! potential overflow! */ and $0xFF,%eax add %eax,%esi pop %eax shr $16,%eax add %ah,%al /* NOTE! potential overflow! */ and $0xFF,%eax add %eax,%esi jmp .Lc16_mainloop /* this loop looks at one byte at a time to handle the remainder */ .Lc16w1: cmpl $0,%edx jle .Lc16z movzbl (%ecx),%eax incl %ecx decl %edx cmpl $32,%eax jne .Lc16w1 inc %esi jmp .Lc16w1 /* done, remember to clear fp/mmx tags with emms */ .Lc16z: emms addl $20,%ebp jmp *(%ebp) ------------------------------ hand/space-bs-c8-acc-1-s.s: Main_zdwcnt_info: movl (%ebp),%esi /* #spaces found */ movl 4(%ebp),%ecx /* ptr */ addl 12(%ebp),%ecx /* ... + idx */ movl 16(%ebp),%edx /* cnt of remaining bytes */ emms /* clear fp tags so we can use mmx instrs */ mov $0x20202020,%eax movd %eax,%mm1 /* mm1: 0000000020202020 */ movq %mm1,%mm0 /* mm0: 0000000020202020 */ psllq $32,%mm1 /* mm1: 2020202000000000 */ por %mm0,%mm1 /* mm1: 2020202020202020 */ mov $0x01010101,%eax movd %eax,%mm2 /* mm2: 0000000001010101 */ movq %mm2,%mm0 /* mm0: 0000000001010101 */ psllq $32,%mm2 /* mm2: 0101010100000000 */ por %mm0,%mm2 /* mm2: 0101010101010101 */ /* MMX loads can use any alignment (potentially at a speed-hit) */ /* therefore we don't have to try to read 1-7 bytes one at a time */ /* first in order to end up with an aligned %ecx. */ .Lc16_mainloop: cmpl $8,%edx jl .Lc16w1 movl %edx,%eax shr $3,%eax cmpl $127,%eax jle .Lc16_127 movl $127,%eax .Lc16_127: shl $3,%eax sub %eax,%edx shr $3,%eax pxor %mm3,%mm3 /* clear block of space counters */ /* loop up to 127 times in a loop that looks at 8 bytes at a time. */ /* Going above 255 could overflow the 8 counters in mm3. */ /* Going above 127 could overflow the horizontal summation code. */ cmpl $0,%eax jle .Lc16w8end /* this is an unspeakably ugly and sloppy loop unrolling. Doesn't */ /* seem to help much on an Athlon64 3000+. */ test $1,%eax jz .Lc16w8 incl %eax jmp .Lc16w8x .Lc16w8: movq (%ecx),%mm0 /* mm0 holds 8 characters */ addl $8,%ecx pcmpeqb %mm1,%mm0 /* cmp byte for byte with ' ' */ /* the result flag is 00 or FF */ pand %mm2,%mm0 /* turn FF into 01, which is actually useful */ paddb %mm0,%mm3 /* add to the 8 space counters */ .Lc16w8x: movq (%ecx),%mm0 /* mm0 holds 8 characters */ addl $8,%ecx pcmpeqb %mm1,%mm0 /* cmp byte for byte with ' ' */ /* the result flag is 00 or FF */ pand %mm2,%mm0 /* turn FF into 01, which is actually useful */ paddb %mm0,%mm3 /* add to the 8 space counters */ subl $2,%eax jnz .Lc16w8 .Lc16w8end: /* sum the 8 space counters in mm3 and add to %esi */ /* if only MMX had horizontal byte adds... */ movd %mm3,%eax push %eax add %ah, %al /* NOTE! potential overflow! */ and $0xFF,%eax add %eax,%esi pop %eax shr $16,%eax add %ah,%al /* NOTE! potential overflow! */ and $0xFF,%eax add %eax,%esi psrlq $32,%mm3 movd %mm3,%eax push %eax add %ah, %al /* NOTE! potential overflow! */ and $0xFF,%eax add %eax,%esi pop %eax shr $16,%eax add %ah,%al /* NOTE! potential overflow! */ and $0xFF,%eax add %eax,%esi jmp .Lc16_mainloop /* this loop looks at one byte at a time to handle the remainder */ .Lc16w1: cmpl $0,%edx jle .Lc16z movzbl (%ecx),%eax incl %ecx decl %edx cmpl $32,%eax jne .Lc16w1 inc %esi jmp .Lc16w1 /* done, remember to clear fp/mmx tags with emms */ .Lc16z: emms addl $20,%ebp jmp *(%ebp) ==============================

1 0

[1/16] SBM: The Haskell and C benchmarks
by Peter Firefly Brodersen Lund 22 Dec '07

22 Dec '07

Here are the 48 Haskell and C benchmarks. Don Stewart contributed three (although I had to fight a bit to make one of them compile). Jules Bean (quicksilver) contributed one. Bertram Felgenhauer (int-e) contributed three (in the form of a single file, which I untangled). Spencer Jannsen (sjannsen) contributed one. wli (William Lee Irwin III) inspired me to add the getwchar benchmarks. I used the following shell code to gather all the benchmarks: (for F in hs/*.hs c/*.c; \ do echo "------------------------------"; \ echo "$F:"; \ echo ; \ cat "$F"; \ done; \ echo "==============================" \ ) > xx.txt They are not in the same order as in the Makefile or in the reports, unfortunately. -Peter ------------------------------ hs/byte-bs----acc.hs: {-# LANGUAGE BangPatterns #-} import qualified Data.ByteString as B cnt :: Int -> B.ByteString -> Int cnt !acc !bs = if B.null bs then acc else cnt (acc+1) (B.tail bs) main = do s <- B.getContents print (cnt 0 s) ------------------------------ hs/byte-bs----foldlx.hs: {-# LANGUAGE BangPatterns #-} import qualified Data.ByteString as B cnt :: B.ByteString -> Int cnt !bs = B.foldl' (\sum _ -> sum+1) 0 bs main = do s <- B.getContents print (cnt s) ------------------------------ hs/byte-bs----foldrx.hs: {-# LANGUAGE BangPatterns #-} import qualified Data.ByteString as B cnt :: B.ByteString -> Int cnt !bs = B.foldr' (\_ sum -> sum+1) 0 bs main = do s <- B.getContents print (cnt s) ------------------------------ hs/byte-bsl---acc.hs: {-# LANGUAGE BangPatterns #-} import qualified Data.ByteString.Lazy as B cnt :: Int -> B.ByteString -> Int cnt !acc !bs = if B.null bs then acc else cnt (acc+1) (B.tail bs) main = do s <- B.getContents print (cnt 0 s) ------------------------------ hs/byte-xxxxx-acc-1.hs: {-# LANGUAGE BangPatterns #-} cnt :: Int -> String -> Int cnt !acc bs = if null bs then acc else cnt (acc+1) (tail bs) main = do s <- getContents print (cnt 0 s) ------------------------------ hs/byte-xxxxx-acc-2.hs: {-# LANGUAGE BangPatterns #-} cnt :: Int -> String -> Int cnt !acc !bs = if null bs then acc else cnt (acc+1) (tail bs) main = do s <- getContents print (cnt 0 s) ------------------------------ hs/byte-xxxxx-foldl.hs: {-# LANGUAGE BangPatterns #-} cnt :: String -> Int cnt !bs = foldl (\sum _ -> sum+1) 0 bs main = do s <- getContents print (cnt s) ------------------------------ hs/byte-xxxxx-foldr-1.hs: {-# LANGUAGE BangPatterns #-} cnt :: String -> Int cnt bs = foldr (\_ sum -> sum+1) 0 bs main = do s <- getContents print (cnt s) ------------------------------ hs/byte-xxxxx-foldr-2.hs: {-# LANGUAGE BangPatterns #-} cnt :: String -> Int cnt !bs = foldr (\_ sum -> sum+1) 0 bs main = do s <- getContents print (cnt s) ------------------------------ hs/space-bs-c8-acc-1.hs: {-# LANGUAGE BangPatterns #-} import qualified Data.ByteString.Char8 as B cnt :: Int -> B.ByteString -> Int cnt !acc bs = if B.null bs then acc else cnt (if B.head bs == ' ' then acc+1 else acc) (B.tail bs) main = do s <- B.getContents print (cnt 0 s) ------------------------------ hs/space-bs-c8-count.hs: -- Don Stewart import qualified Data.ByteString.Char8 as B main = print . B.count ' ' =<< B.getContents ------------------------------ hs/space-bs-c8-foldlx-1.hs: {-# LANGUAGE BangPatterns #-} import qualified Data.ByteString.Char8 as B cnt :: B.ByteString -> Int cnt bs = B.foldl' (\sum c -> if c == ' ' then sum+1 else sum) 0 bs main = do s <- B.getContents print (cnt s) ------------------------------ hs/space-bs-c8-foldlx-2.hs: {-# LANGUAGE BangPatterns #-} import qualified Data.ByteString.Char8 as B main = do s <- B.getContents print $ B.foldl' (\v c -> if c == ' ' then v+1 else v :: Int) 0 s ------------------------------ hs/space-bs-c8-foldrx.hs: {-# LANGUAGE BangPatterns #-} import qualified Data.ByteString.Char8 as B cnt :: B.ByteString -> Int cnt bs = B.foldr' (\c sum -> if c == ' ' then sum+1 else sum) 0 bs main = do s <- B.getContents print (cnt s) ------------------------------ hs/space-bs-c8-lenfil.hs: {-# LANGUAGE BangPatterns #-} import qualified Data.ByteString.Char8 as B cnt :: B.ByteString -> Int cnt bs = B.length (B.filter (== ' ') bs) main = do s <- B.getContents print (cnt s) ------------------------------ hs/space-bslc8-acc-1.hs: {-# LANGUAGE BangPatterns #-} import qualified Data.ByteString.Lazy.Char8 as B cnt :: Int -> B.ByteString -> Int cnt !acc bs = if B.null bs then acc else cnt (if B.head bs == ' ' then acc+1 else acc) (B.tail bs) main = do s <- B.getContents print (cnt 0 s) ------------------------------ hs/space-bslc8-acc-2.hs: {-# LANGUAGE BangPatterns #-} import qualified Data.ByteString.Lazy.Char8 as B cnt :: Int -> B.ByteString -> Int cnt !acc !bs = if B.null bs then acc else cnt (if B.head bs == ' ' then acc+1 else acc) (B.tail bs) main = do s <- B.getContents print (cnt 0 s) ------------------------------ hs/space-bslc8-acc-3.hs: {-# LANGUAGE BangPatterns #-} -- this version by quicksilver import qualified Data.ByteString.Lazy.Char8 as B cnt :: Int -> B.ByteString -> Int cnt !acc bs | B.null bs = acc | B.head bs == ' ' = cnt (acc+1) (B.tail bs) | otherwise = cnt acc (B.tail bs) main = do s <- B.getContents print (cnt 0 s) ------------------------------ hs/space-bslc8-chunk-1.hs: {-# LANGUAGE BangPatterns #-} -- this version by int-e import qualified Data.ByteString.Lazy.Char8 as B import qualified Data.ByteString.Char8 as BS import Data.List (foldl') cntS :: Int -> BS.ByteString -> Int cntS !acc !bs = case BS.uncons bs of Nothing -> acc Just (hd, tl) | hd == ' ' -> cntS (acc+1) tl | otherwise -> cntS acc tl cnt :: Int -> B.ByteString -> Int cnt acc bs = foldl' cntS acc (B.toChunks bs) main = do s <- B.getContents print $ cnt 0 s ------------------------------ hs/space-bslc8-chunk-2.hs: {-# LANGUAGE BangPatterns #-} -- this version by int-e import qualified Data.ByteString.Lazy.Char8 as B import qualified Data.ByteString.Char8 as BS import Data.List (foldl') cntS' :: Int -> BS.ByteString -> Int cntS' !acc !bs | BS.null bs = acc | BS.head bs == ' ' = cntS' (acc+1) (BS.tail bs) | otherwise = cntS' acc (BS.tail bs) cnt :: Int -> B.ByteString -> Int cnt acc bs = foldl' cntS' acc (B.toChunks bs) main = do s <- B.getContents print $ cnt 0 s ------------------------------ hs/space-bslc8-chunk-3.hs: {-# LANGUAGE BangPatterns #-} -- this version by int-e import qualified Data.ByteString.Lazy.Char8 as B import qualified Data.ByteString.Char8 as BS import Data.List (foldl') cntS'' :: Int -> BS.ByteString -> Int cntS'' !acc !bs = BS.foldl' (\v c -> if c == ' ' then v+1 else v) acc bs cnt :: Int -> B.ByteString -> Int cnt acc bs = foldl' cntS'' acc (B.toChunks bs) main = do s <- B.getContents print $ cnt 0 s ------------------------------ hs/space-bslc8-chunk-4.hs: {-# LANGUAGE BangPatterns #-} -- Don Stewart import qualified Data.ByteString.Lazy.Char8 as BLC8 import qualified Data.ByteString.Lazy.Internal as BLI import qualified Data.ByteString as B import qualified Data.ByteString.Unsafe as BU import qualified Data.ByteString.Internal as BI cnt :: Int -> BLC8.ByteString -> Int cnt n BLI.Empty = n cnt n (BLI.Chunk x xs) = cnt (n + cnt_strict 0 x) xs -- process lazy spine where -- now we can process a chunk without checking for Empty cnt_strict !i !s -- then strict chunk | B.null s = i | c == ' ' = cnt_strict (i+1) t | otherwise = cnt_strict i t where (c,t) = (BI.w2c (BU.unsafeHead s), BU.unsafeTail s) -- no bounds check main = do s <- BLC8.getContents; print (cnt 0 s) ------------------------------ hs/space-bslc8-count.hs: -- Don Stewart import qualified Data.ByteString.Lazy.Char8 as B main = print . B.count ' ' =<< B.getContents ------------------------------ hs/space-bslc8-foldl.hs: {-# LANGUAGE BangPatterns #-} import qualified Data.ByteString.Lazy.Char8 as B cnt :: B.ByteString -> Int cnt !bs = B.foldl (\sum c -> if c == ' ' then sum+1 else sum) 0 bs main = do s <- B.getContents print (cnt s) ------------------------------ hs/space-bslc8-foldlx-1.hs: {-# LANGUAGE BangPatterns #-} import qualified Data.ByteString.Lazy.Char8 as B cnt :: B.ByteString -> Int cnt bs = B.foldl' (\sum c -> if c == ' ' then sum+1 else sum) 0 bs main = do s <- B.getContents print (cnt s) ------------------------------ hs/space-bslc8-foldlx-2.hs: {-# LANGUAGE BangPatterns #-} import qualified Data.ByteString.Lazy.Char8 as B cnt :: B.ByteString -> Int cnt !bs = B.foldl' (\sum c -> if c == ' ' then sum+1 else sum) 0 bs main = do s <- B.getContents print (cnt s) ------------------------------ hs/space-bslc8-foldr-1.hs: {-# LANGUAGE BangPatterns #-} import qualified Data.ByteString.Lazy.Char8 as B cnt :: B.ByteString -> Int cnt bs = B.foldr (\c sum -> if c == ' ' then sum+1 else sum) 0 bs main = do s <- B.getContents print (cnt s) ------------------------------ hs/space-bslc8-foldr-2.hs: {-# LANGUAGE BangPatterns #-} import qualified Data.ByteString.Lazy.Char8 as B cnt :: B.ByteString -> Int cnt !bs = B.foldr (\c sum -> if c == ' ' then sum+1 else sum) 0 bs main = do s <- B.getContents print (cnt s) ------------------------------ hs/space-bslc8-lenfil-1.hs: {-# LANGUAGE BangPatterns #-} import qualified Data.ByteString.Lazy.Char8 as B import GHC.Int (Int64) -- note that D.BS.Lazy.Char8.length is ByteString -> Int64 -- D.BS.C8.length is ByteString -> Int cnt :: B.ByteString -> Int64 cnt bs = B.length (B.filter (== ' ') bs) main = do s <- B.getContents print (cnt s) ------------------------------ hs/space-bslc8-lenfil-2.hs: {-# LANGUAGE BangPatterns #-} import qualified Data.ByteString.Lazy.Char8 as B import GHC.Int (Int64) -- note that D.BS.Lazy.Char8.length is ByteString -> Int64 -- D.BS.C8.length is ByteString -> Int cnt :: B.ByteString -> Int64 cnt !bs = B.length (B.filter (== ' ') bs) main = do s <- B.getContents print (cnt s) ------------------------------ hs/space-bsl---foldlx.hs: {-# LANGUAGE BangPatterns #-} -- this version by sjannsen import Data.ByteString.Lazy as B cnt :: B.ByteString -> Int cnt = B.foldl' f 0 where f !n 32 = n+1 f !n _ = n main = do s <- B.getContents print $ cnt s ------------------------------ hs/space-xxxxx-acc-1.hs: {-# LANGUAGE BangPatterns #-} cnt :: Int -> String -> Int cnt !acc bs = if null bs then acc else cnt (if head bs == ' ' then acc+1 else acc) (tail bs) main = do s <- getContents print (cnt 0 s) ------------------------------ hs/space-xxxxx-acc-2.hs: {-# LANGUAGE BangPatterns #-} cnt :: Int -> String -> Int cnt !acc !bs = if null bs then acc else cnt (if head bs == ' ' then acc+1 else acc) (tail bs) main = do s <- getContents print (cnt 0 s) ------------------------------ hs/space-xxxxx-foldl.hs: {-# LANGUAGE BangPatterns #-} cnt :: String -> Int cnt bs = foldl (\sum c -> if c == ' ' then sum+1 else sum) 0 bs main = do s <- getContents print (cnt s) ------------------------------ hs/space-xxxxx-foldr-1.hs: {-# LANGUAGE BangPatterns #-} cnt :: String -> Int cnt bs = foldr (\c sum -> if c == ' ' then sum+1 else sum) 0 bs main = do s <- getContents print (cnt s) ------------------------------ hs/space-xxxxx-foldr-2.hs: {-# LANGUAGE BangPatterns #-} cnt :: String -> Int cnt !bs = foldr (\c sum -> if c == ' ' then sum+1 else sum) 0 bs main = do s <- getContents print (cnt s) ------------------------------ hs/space-xxxxx-lenfil.hs: {-# LANGUAGE BangPatterns #-} cnt :: String -> Int cnt bs = length (filter (== ' ') bs) main = do s <- getContents print (cnt s) ------------------------------ c/byte-4k.c: #include <stdio.h> #include <unistd.h> #include <stdlib.h> #include <errno.h> int Main_cnt() { int cnt = 0; ssize_t sze; char buf[4*1024]; do { again: sze = read(fileno(stdin), buf, sizeof(buf)); if (sze < 0) { switch (errno) { case EAGAIN: goto again; default: perror("read() failed\n"); exit(1); } } cnt += sze; } while (sze != 0); return cnt; } int main() { printf("%d\n", Main_cnt()); return EXIT_SUCCESS; } ------------------------------ c/byte-getchar.c: #include <stdio.h> #include <stdlib.h> int Main_cnt() { int cnt = 0; int c; while ((c = getchar()) != EOF) cnt++; return cnt; } int main() { printf("%d\n", Main_cnt()); return EXIT_SUCCESS; } ------------------------------ c/byte-getchar-u.c: #include <stdio.h> #include <stdlib.h> int Main_cnt() { int cnt = 0; int c; while ((c = getchar_unlocked()) != EOF) cnt++; return cnt; } int main() { printf("%d\n", Main_cnt()); return EXIT_SUCCESS; } ------------------------------ c/space-32k-8.c: #include <stdio.h> #include <unistd.h> #include <stdlib.h> #include <errno.h> int Main_cnt() { int cnt = 0; ssize_t sze, left; char buf[32760]; char *p; printf("using a buffer of %g KB\n", sizeof(buf) / 1024.0); do { again: sze = read(fileno(stdin), buf, sizeof(buf)); if (sze < 0) { switch (errno) { case EAGAIN: goto again; default: perror("read() failed\n"); exit(1); } } for (p = buf, left=sze; left > 0; left--) if (*p++ == ' ') cnt++; } while (sze != 0); return cnt; } int main() { printf("%d\n", Main_cnt()); return EXIT_SUCCESS; } ------------------------------ c/space-32k.c: #include <stdio.h> #include <unistd.h> #include <stdlib.h> #include <errno.h> int Main_cnt() { int cnt = 0; ssize_t sze, left; char buf[32*1024]; char *p; printf("using a buffer of %g KB\n", sizeof(buf) / 1024.0); do { again: sze = read(fileno(stdin), buf, sizeof(buf)); if (sze < 0) { switch (errno) { case EAGAIN: goto again; default: perror("read() failed\n"); exit(1); } } for (p = buf, left=sze; left > 0; left--) if (*p++ == ' ') cnt++; } while (sze != 0); return cnt; } int main() { printf("%d\n", Main_cnt()); return EXIT_SUCCESS; } ------------------------------ c/space-4k.c: #include <stdio.h> #include <unistd.h> #include <stdlib.h> #include <errno.h> int Main_cnt() { int cnt = 0; ssize_t sze, left; char buf[4*1024]; char *p; printf("using a buffer of %g KB\n", sizeof(buf) / 1024.0); do { again: sze = read(fileno(stdin), buf, sizeof(buf)); if (sze < 0) { switch (errno) { case EAGAIN: goto again; default: perror("read() failed\n"); exit(1); } } for (p = buf, left=sze; left > 0; left--) if (*p++ == ' ') cnt++; } while (sze != 0); return cnt; } int main() { printf("%d\n", Main_cnt()); return EXIT_SUCCESS; } ------------------------------ c/space-getchar.c: #include <stdio.h> #include <stdlib.h> int Main_cnt() { int cnt = 0; int c; while ((c = getchar()) != EOF) if (c == ' ') cnt++; return cnt; } int main() { printf("%d\n", Main_cnt()); return EXIT_SUCCESS; } ------------------------------ c/space-getchar-u.c: #include <stdio.h> #include <stdlib.h> int Main_cnt() { int cnt = 0; int c; while ((c = getchar_unlocked()) != EOF) if (c == ' ') cnt++; return cnt; } int main() { printf("%d\n", Main_cnt()); return EXIT_SUCCESS; } ------------------------------ c/space-getwchar.c: #include <stdio.h> #include <wchar.h> #include <stdlib.h> int Main_cnt() { int cnt = 0; wint_t c; while ((c = getwchar()) != WEOF) if (c == ' ') cnt++; return cnt; } int main() { printf("%d\n", Main_cnt()); return EXIT_SUCCESS; } ------------------------------ c/space-getwchar-u.c: #define _GNU_SOURCE #include <stdio.h> #include <wchar.h> #include <stdlib.h> int Main_cnt() { int cnt = 0; wint_t c; while ((c = getwchar_unlocked()) != WEOF) if (c == ' ') cnt++; return cnt; } int main() { printf("%d\n", Main_cnt()); return EXIT_SUCCESS; } ------------------------------ c/space-megabuf.c: #include <stdio.h> #include <stdlib.h> #include <errno.h> #include <sys/stat.h> #include <unistd.h> int isfile(int handle) { struct stat buf; if (fstat(handle, &buf) == -1) { perror("fstat(stdin)\n"); exit(1); } return S_ISREG(buf.st_mode); } ssize_t getbufsize() { if (isfile(fileno(stdin))) { off_t x; x = lseek(fileno(stdin), 0, SEEK_END); if (x == -1) { perror("lseek(... SEEK_END)\n"); exit(1); } if (lseek(fileno(stdin), 0, SEEK_SET) == -1) { perror("lseek(... SEEK_SET)\n"); exit(1); } if (x > 1*1024*1024*1024LL) { x = 1024*1024*1024LL; } return x; /* file size for files */ } else { return 10*1024*1024; /* 10M for non-files */ } } int Main_cnt() { int cnt = 0, reads=0, retries=0; ssize_t sze, left, bufsize; char *buf; char *p; bufsize = getbufsize(); printf("using a buffer of %g MB\n", bufsize / (1024*1024.0)); buf = malloc(bufsize); if (!buf) { fprintf(stderr, "couldn't allocate %lld bytes\n", (long long) bufsize); } do { again: sze = read(fileno(stdin), buf, bufsize); if (sze < 0) { switch (errno) { case EAGAIN: retries++; goto again; default: perror("read() failed\n"); exit(1); } } reads++; for (p = buf, left=sze; left > 0; left--) if (*p++ == ' ') cnt++; } while (sze != 0); printf("%d reads, %d retries\n", reads, retries); return cnt; } int main() { printf("%d\n", Main_cnt()); return EXIT_SUCCESS; } ==============================

1 0

[0/16] SBM: Simple Bytestring Microbenchmarks, Overview and Introduction
by Peter Firefly Brodersen Lund 22 Dec '07

22 Dec '07

Table of contents: 0/16 This email 1/16 The Haskell and C benchmarks 2/16 Inner loops of the hand-tweaked assembly benchmarks 3/16 The Makefile 4/16 How to use the Makefile (how to run benchmarks etc.) 5/16 Support scripts and scriptlets 6/16 6.9.20071124 Athlon Duron 7/16 6.9.20071208 Core Duo 8/16 6.9.20071119 Pentium III 9/16 6.9.20071119 Athlon64 10/16 Graphs for 6.9.x across four cpus 11/16 Graphs for hand-tweaked assembly benchmarks 12/16 Graphs for 7 ghc/bytestring combinations on a 2GHz Athlon64 300+ 13/16 Graphs that show the infidelity of -sstderr 14/16 Behind the measurements (rationale) 15/16 Predictions compared to the measurements 16/16 Discussion and Conclusion Simple Bytestring Microbenchmarks --------------------------------- Introduction ------------ I love parsers. I have been writing parsers for fun for over twenty years. The nicest way to construct a parser used to be to write a recursive descent parser by hand. If you had to work with people who'd had the misfortune of a university education, you would resort to lex and yacc (flex and bison), despite their many shortcomings. Combinator parsers is the only real improvement over hand-written recursive descent parsers that I know of. They do tend to require features that not all languages provide. I don't know how to write a good one in C, for example. They do work very well in Haskell, though. So, I've started writing a parser in Haskell (ghc, really) for the programming language X++. X++ is not a nice language but that's beside the point. The challenge for me is to write an efficient compiler + provide good analysis tools for X++. I think I stand a better chance of doing that in Haskell (ghc) than in practically any other language. There are a few drawbacks, though. I love speed. And efficiency. String handling in Haskell -------------------------- Native strings are simple and generally work well, but they are slow, take up too much memory and there's the whole encoding mess that still needs to be sorted out. People have worked on other string representations and libraries for quite a while. I think packedstrings (as used in Darcs) was one of the first ones. Bytestrings is the current incarnation. It seems to be just the right thing, especially when combined with improved automatic fusion in the compiler so higher-order functions don't have to be expensive. I use Parsec as my parser combinator library at the moment, which uses native strings. I would dearly love Parsec to be faster and use less memory. I think bytestrings will be part of any substantial improvement in Parsec's resource consumption. Other performance concerns -------------------------- File I/O is also interesting for a compiler writer. I would like to have a program that is as fast as possible both when the source files are already cached by the operating system and when they are not. The former situation is best handled with mmap() and the latter is best handled by read(), preferably in combination with multi-threading so the compiler doesn't have to waste too much time waiting for disk seeks. Haskell seems to be very close to ideal for me because it has very good threading support and very accessible raw access to the operating system. File I/O is not my current bottleneck, though. I'll probably take a closer look at file I/O when the other performance problems have been solved. Then there's the general quality of the generated code. Having read just about every paper on ghc that was available back in the late nineties (when I first looked at Haskell), I'd thought that the quality was good and that the compiler also had extremely good high-level optimizations, in other words, that abstraction was free. I also read the C-- papers and thought that it was a very interesting and promising approach. I'd expected the C-- path to have matured and be well- optimized by now. Unfortunately, the backend is /the/ weak spot in ghc. The frontend is heroic, the typesystems are (too) abundant and rich, the language itself is nice -- but the backend is not. Looking at the generated code I'd say that it is slightly better than Turbo Pascal 3.x and about on par with Turbo Pascal 4.0, a compiler that didn't use any intermediate code at all, compiled each statement in isolation, was single-pass, and had a compilation speed of about 27000 lines per minute on an 8 MHz IBM PC AT. Ecosystem and culture --------------------- Haskell has a very good ecosystem. Probably the second best one amongst the modern functional languages. Ten years ago, I'd thought that Standard ML would win but the only MLish language with a good ecosystem and culture is OCaml, which unfortunately isn't really Standard ML. By ecosystem I mean things like access to raw operating system calls, access to libraries written in other languages, readily available libraries for graphical user interfaces, databases, XML processing, network I/O. Parsing is nice, too, but practically all functional programming languages have that -- and not much else. The culture is also good. People actually use this stuff. They care about it. And they don't hang around waiting for somebody to tell them what to do, they start on their own. And they actually seem to be interested in good performance :) Hackage and cabal are very promising already and may be what finally makes ghc real-world useful for more people, because most people are not interested in working with raw source packages and fiddling with compiler flags and weird error messages. They don't like chasing dependencies, either. So, what's the problem? ----------------------- The major problem with Haskell (ghc) is that its performance (in terms of both speed and memory use) is unpredictable. The second-worst problem is that the actual performance is not good enough. These benchmarks ---------------- I have written a bunch of microbenchmarks that either count all the bytes in stdin or all the spaces in stdin. And some supporting benchmarks in C. And I've also handtweaked the assembly of one byte-counting and one space-counting microbenchmark to illustrate what difference it would make if the backend could use registers in a less stupid^W^W more efficient manner. Homepage + source code ---------------------- I have put up a homepage for the benchmarks at: http://vax64.dk/ghc-bs-tests The raw measurements are in tarballs on that page. The source code for the benchmarks (+ support code) is in a mercurial repository at: http://vax64.dyndns.org/repo/hg/ghc-bs-tests I used scripts to install the various versions of ghc and bytestring both to avoid operator error and so you could me look over my shoulder. The scripts are in a mercurial repository at: http://vax64.dyndns.org/repo/hg/ghc-installations You can either follow the link and download any version you like as a tarball or you can (preferably) clone the repositories with: hg http://vax64.dyndns.org/repo/hg/ghc-bs-tests hg http://vax64.dyndns.org/repo/hg/ghc-installations All my code in those repositories is GPLv2. The text file 'text.txt' in the ghc-bs-tests repository is unfortunately partly in Danish and partly in very terse English. Acknowledgements ---------------- Daniel Fischer, for running the benchmarks on his SuSE 8.2 Athlon Duron 1200 MHz machine and for being helpful and patient while I made the scripts work on a 2.4 kernel and with unhelpful versions of GNU Make and strace. Erik van der Meer, for letting me run the benchmarks (and install ghc) on his Core Duo laptop. And for discussions on measurements over the years. Don Stewart, for playing along and for fixing a bytestring problem. And for contributing three benchmarks (one of which I had to change a bit before it would compile). Duncan Coutts, for playing along. Jules Bean (quicksilver), for contributing a benchmark. Bertram Felgenhauer (int-e), for contributing three benchmarks (in the form of a single file, which I untangled to three files). Spencer Jannsen (sjannsen), for contributing a benchmark. William Lee Irwin III (wli), for inspiring me to add the getwchar benchmarks. -Peter

1 0

readline problems building GHC on Mac OS X (was: Re: [Haskell-cafe] Re: ANNOUNCE: GHC version 6.8.2)
by Thorkil Naur 22 Dec '07

22 Dec '07

Hello, Although I have been building various GHC versions on various PPC Mac OS X systems for a while now, I'm afraid that I don't really have a good answer for your questions. However, your questions provide an excellect opportunity to discuss this, so that is what I am going to do. There are several questions here: (1) Which readline do we use? (2) Where do we store it? (3) What do we call it? (4) How do we make the Haskell readline library build process select the right one? And perhaps (5) How do we persuade the GHC build process to make the Haskell readline build that happens as part of building GHC select the right one? One at a time: 1. Which readline do we use? GNU readline, of course. As opposed to the readline installed as /usr/include/readline/*.h and /usr/lib/libreadline.dylib on our PPC Mac OS X machines which are said to be (and can even be observed to be) symbolic links to something called libedit and which, to me, never has managed to provide something suitable for use by GHC. But what is GNU readline, then? I don't exactly know, but my best guess is something like ftp://ftp.cwru.edu/pub/bash/readline-5.2.tar.gz. I never tried to install GNU readline directly from this file. On some occasions, I have installed readline from mac ports. Although I am fairly confident that what was installed was some version of the GNU readline, I am not sure. On other occasions, I have installed GNU readline from various sources related to GHC, some times known to me, at other times not. 2.Where do we store readline? I don't know where a readline based on the GNU download ftp://ftp.cwru.edu/pub/bash/readline-5.2.tar.gz would become installed (by default). The mac ports version installs by default at /opt/local/include/readline/*.h and /opt/local/lib/libreadline.*. Various readlines related to GHC have installed themselves (or were requested to become installed) as frameworks, this new and different Mac OS X mechanism for referring to a set of header files and corresponding library. So they have gone into /Library/Frameworks. 3. What do we call it? Here is where the interesting things start to happen: A central problem has been the ambiguity caused by Apple's decision to install symbolic links to the "edit" headers and "edit" library called "readline". And various mechanisms have been used to work around this problem: (a) If you have installed a mac ports readline at /opt/local/..., with GHC 6.6 at least, you were able to use the --with-readline-* options to direct GHC/the library build process to look in these directories first and thereby avoid the "edit" library; (b) At some point, a (possibly modified) version of the GNU readline library appeared, intended to be installed as a framework by the name of "GNUreadline" (as opposed to the bare "readline" name used earlier). This avoids the name clash caused by the Apple linking of "readline" to "edit". The problem that the Haskell readline library now needs to refer to a framework "GNUreadline" rather than ... (whatever it is that it refers to in a more Unix'y setting) is solvable. In addition, however, the readline library (or rather: The GNUreadline library derived from the readline library) refers to itself using the bare "readline" name, so that has to be changed also, leading to a need to maintain a complete and slightly modified version (GNUreadline) of the readline library. It seems to me that this situation is less than ideal. I mean, in theory, somebody may come along at some point with some library calling itself GNUreadline and then we would have to adapt, doing the whole thing all over again. This manner of avoiding the name clash problem does not seem tenable in the long run. Instead, what we should be able to do, is to specify, directly and to the point, that "readline", wherever we stored it, is what we want. That possibility does not exist, unfortunately, so we will have to make the best use that we can of the existing mechanisms, as far as we can figure out what they are, to get the desired effect. And if it turns out that the existing mechanisms do not allow us to do what we want, we need to request extensions and modifications of the mechanisms, until they are able to support our requirements. I am not quite sure that I am done with this subject, but let me go on with 4. How do we make the Haskell readline library build process select the right one? This is where I believe we can do something useful, making the Haskell readline library more capable in selecting its foundation readline library. I haven't worked out the details, some discussion is at http://hackage.haskell.org/trac/ghc/ticket/1395 and related tickets, but I am quite sure that methods can be found to select the desired readline library, without resorting to reissuing that library in a changed form and under a new name. And if this turns out to be absolutely impossible, I would much prefer pressing for the introduction of mechanisms that makes it possible to select the desired version of the library, removing this impossibility. Rather than issuing the library under a different name. Finally: 5. How do we persuade the GHC build process to make the Haskell readline build that happens as part of building GHC select the right one? Answer: I don't know. At some point, I did know, that was when the --with-readline-* options were introduced for the GHC ./configure. Nowadays, I am not sure. Generally, I believe that it is fine for the GHC build process (whatever phase) to pass parameters to the build process of some library. But at the same time, the fact that such passing of parameters takes place must be very explicitly reported somewhere, in the output of the build process, probably. Best regards Thorkil On Friday 21 December 2007 21:48, John Dorsey wrote: > (Moving to the cafe) > > On a related topic, I've been trying to build 6.8.2 on Leopard lately. > I've been running up against the infamous OS X readline issues. I know > some builders here have hacked past it, but I'm looking for a good > workaround... ideally one that works without changes outside the GHC > build area (besides installing a real readline). > > Here's what I noticed before I started drowning in the build platform. > (I'm no gnu-configure expert nor GHC insider.) > > I can get gnu-readline installed from Macports, no problem. > > The top-level configure in GHC doesn't respond to my various attempts: > > o using --with-readline-libraries and --with-readline-includes > (Although it looks like the libraries/readline/configure script > might recognize these, I can't get an option to pass through.) > o setting LDFLAGS and CPPFLAGS environment variables (with > -L/opt/local/lib and -I/opt/local/include resp.) in my shell > before running configure > o playing with the above settings and others in a mk/build.mk > > Until Apple fixes their broken-readline issue (maybe when the readline > compatibility of libedit improves)... maybe the top-level configure can > pass through flags or settings somehow? > > For those who've built with readline on OS X: have you had to resort to > blasting the existing readline library link, or is there a configuration > option within the GHC tree that you've gotten to work? > > Should I be filing a trac bug instead of asking here? > > Thanks for any help. There's no urgency for me; I'm just trying to get > a working environment at home; I'd prefer to be able to bootstrap from > the ground up; and I'd like to be able to contribute to testing/debugging > on OSX. > > John > > _______________________________________________ > Haskell-Cafe mailing list > Haskell-Cafe(a)haskell.org > http://www.haskell.org/mailman/listinfo/haskell-cafe >

2 1

Optimizing cellular automata & the beauty of unlifted types
by Justin Bailey 21 Dec '07

21 Dec '07

I'm back with another version of my cellular automata simulator. Since the last iteration I discovered GHC's unlifted types and the primitive operations that go with them. Using these types, rather than Ints, sped my code up by 2x or more. http://hpaste.org/4151#a2 -- first half of program http://hpaste.org/4151#a3 -- remaining (note 3 lines or so from first half are repeated) The key observation came from looking at the Core files, which showed a lot of int2word# and word2int# conversions going on. Figuring out how to remove those led me to the unlifted types. Coding with these types is not easy (for example, I couldn't see a way to write a Word# value directly - I had to write stuff like "int2Word# 1#"), but because I had an existing algorithm to guide me, combined with type checking, it was a fairly straightforward implementation. At first I felt kind of bad about using these operations, but then I noticed they are used pretty heavily by GHC itself. If it's good enough for GHC, it's good enough for me. The 2x performance gain didn't hurt either. Finally, the safety that comes from using the ST monad is just awesome. I've got low-level bit munging combined with high-level abstraction where I need it. So cool! I was disappointed to find early on that using higher-order functions in tight loops was a performance killer. It's unfortunate because I started with a very elegant, short implementation based on a simple Ring buffer and map. The current version is certainly harder to understand and has some weird limitations. However, having the simple implementation let me use quickcheck to compare their results on random rules and inputs, which gave me high confidence that my complex implemenation is correct. One thing I absolutely love about this program is its memory performance. It manages to stay within 1 - 10 MB of memory, depending on how much output is produced. How cool is that? On Dec 3, 2007 2:44 AM, Mirko Rahn <rahn(a)ira.uka.de > wrote: > It is interesting, that the naive implementation > ... is only 3 times slower than your quite complex, hard to follow and hard > to debug implementation. > Now the naive implementation is 100x slower, so I don't feel so bad about this comment any more. > > As always, I prefer to write most code in Haskell, quick, easy, nice, > reasonable fast, ... If speed matters, I switch to some lower level > language, as you did staying inside Haskell. > I have to take exception here - I *want* to write my code in Haskell. If Haskell isn't fast enough to beat the C implementation, I'd rather find a way to make my Haskell program faster than switch to some other language. Justin

4 5

eager/strict eval katas
by Thomas Hartman 21 Dec '07

21 Dec '07

I'm trying to get a better handle on eager/strict eval in haskell, and a great way to do this is by building up from simple exercises to harder exercises. So far I have exercise 1) add the integers [1..10^6] (stack overflows if you do a naive fold, as described on wiki) exercise 2) find the first integer such that average of [1..n] is > [10^6] (solution involves building an accum list of (average,listLength) tuples. again you can't do a naive fold due to stack overflow, but in this case even strict foldl' from data.list isn't "strict enough", I had to define my own custom fold to be strict on the tuples.) anybody got other suggestions, or links to places where eager eval is required to solve simply stated problems? or exercises that demystify doing eager IO/eager whatever monad, where that is required? Also am I correct that the terms eager and strict can be used more or less interchangeably in this problem space? Tired of this folk wisdom that haskell is only for the elite because getting around stack overflow from lazy eval is impossible to teach to newbies. t. --- This e-mail may contain confidential and/or privileged information. If you are not the intended recipient (or have received this e-mail in error) please notify the sender immediately and destroy this e-mail. Any unauthorized copying, disclosure or distribution of the material in this e-mail is strictly forbidden.

6 8