
In response to the two suggestions, I've linked my directory to: http://www.isi.edu/~hdaume/haskell/NewBinary In there you will find FastMutInt, which is (almost) unchanged, Binary.hs which contains the new bit-based stuff, and a few test programs. You will also find profiles for bits and bytes (only 100000 elements this time, but there's still a huge discrepancy between bits and bytes) in: bits_io_p.prof bytes_io_p.prof One thing that immediately surprises me looking at the profile information is that the bits version allocates 200m more heap than the bytes version. I have a function in FastMutInt for or-ing them. The comment next to it says "we should optimize this: ask SimonM :)", so I am asking SimonM now for a version which looks like incFastMutIntBy and hopefully uses unboxed or. Nevertheless, this only accounts for 1% of the time, so it's probably not a big deal. All the time is taken up in putBits and getBits, which both check their arguments to make sure you're not putting too big or too small values. One thing to do would be to make unsafePutBits and unsafeGetBits which don't do these checks; then the default instances could use these and thus we wouldn't have to perform so many checks. In fact, I commented these out and reran the profiled bit version; the results are in bits_io_nc_p.prof Doing this cuts the difference in time between bit-based ops and byte-based ops in half (this introduces a small bug with seekBin because they use getBits bh 0 to align the bits; a quick when statement fixes this). Now, getBits seems to be calling writeFastMutInt too much, probably because of the last case in the long if statement, where we essentially cheat and call getByte, as: writeFastMutInt (bit_cache_r bh) 0 writeFastMutInt (bit_off_r bh) 0 bit_cache <- getByte bh writeFastMutInt (bit_off_r bh) (num_bits - leftover_bits) writeFastMutInt (bit_cache_r bh) (fromIntegral bit_cache) the only reason we have to do these first two writes is to make sure that getByte doesn't recurse and call getBits. Unfortunately, this is probably the most common case, so I will optimize this and essentially duplicate the getByte code within there. (Note that the version of Binary pre the optimizations listed in this email is in BinaryPreOpt.hs.) I've done this by copying getWord8 to getByteNoBits which completely ignores the bit stuff and thus is very unsafe. putBits suffers from the same problem; namely: writeFastMutInt (bit_off_r bh) 0 writeFastMutInt (bit_cache_r bh) 0 putByte bh (bit_cache .|. (bits `shiftL` bit_off)) -- won't call putBits so we duplicate putByte to putByteNoBits and call this instead. The results of profiling this version are in: bits_io_opt_p.prof This change sped things up *significantly*, especially on reading. To get a full sense of where we are now, I rebuilt non-profiling versions (note that the bytes version should be unaffected by this change, but just to be sure I built it again). The results are (averaged over three runs, 1m data elements, all IO): bits read 2031 write 2582 bytes read 1694 write 2006 So reading is now ~20% slower and writing is about 28% slower. Not a huge improvement. (Offhand note: SimonM, I just noticed that when I rebuilt this, it didn't rebuild SpeedTestBytes.hs even though Binary.hs had changed massively, resulting in an error of linking...I'll send you a separate email about this and see if I can reproduce it. Remind me if I forget :P.) Looking at bits_io_opt_p.prof, we see that putByteNoBits really dominates the amount of time spent. I noticed that putWord8 and friends are calling readFastMutInt followed by WriteFastMutInt +1, so I changed this to incFastMutInt, but I don't think this will make much of a difference. This is about as trimmed down as putByteNoBits will get, so all the time it is taking must be from putChar. I think that all that's left in terms of slowness are actually the FastMutInts. These are getting a much heavier pounding in the bit version due to the fact that bit-offset and bit-cache are now fastmutints. I don't suppose there's a faster way to implement these? - Hal -- Hal Daume III "Computer science is no more about computers | hdaume@isi.edu than astronomy is about telescopes." -Dijkstra | www.isi.edu/~hdaume On Thu, 19 Dec 2002, Simon Marlow wrote:
I tend to agree with Alastair. I just reran the experiments with BinIOs instead of BinMems to see if there was a greater difference (I expect a much smaller difference, in fact, due to the high disk overhead and the fact that putChar is slow). I didn't both with -O0, but with -O2, we get:
bits read 2156 write 2639
bytes read 1617 write 2078
frankly, this shocks me. reading using bits is 33% slower; writing is about 27% slower. i actually expected the bits version to be *faster* (relatively, of course) here, due to the fact that we have much smaller files (the byte-based file is 4200000 bytes while the bit-based file is 3325000 bytes). With a 33% difference, I think I might want separate instances again :).
Try profiling. It's hard to tell where the speed is going without seeing the code - would you like to put it up somewhere so we can take a look?
Remember that putByte and friends have been carefully tuned, I suspect we'll need to take a close look at the compiler output for your bit versions and tweak a few things.
Cheers, Simon _______________________________________________ Libraries mailing list Libraries@haskell.org http://www.haskell.org/mailman/listinfo/libraries