RE: getting a Binary module into the standard libs

19 Dec 2002

      In response to the two suggestions, I've linked my directory to:

  http://www.isi.edu/~hdaume/haskell/NewBinary

In there you will find FastMutInt, which is (almost) unchanged, Binary.hs
which contains the new bit-based stuff, and a few test programs.  You will
also find profiles for bits and bytes (only 100000 elements this time, but
there's still a huge discrepancy between bits and bytes) in:

  bits_io_p.prof
  bytes_io_p.prof

One thing that immediately surprises me looking at the profile information
is that the bits version allocates 200m more heap than the bytes version.

I have a function in FastMutInt for or-ing them.  The comment next to it
says "we should optimize this: ask SimonM :)", so I am asking SimonM
now for a version which looks like incFastMutIntBy and hopefully uses
unboxed or.  Nevertheless, this only accounts for 1% of the time, so it's
probably not a big deal.

All the time is taken up in putBits and getBits, which both check their
arguments to make sure you're not putting too big or too small
values.  One thing to do would be to make unsafePutBits and unsafeGetBits
which don't do these checks; then the default instances could use these
and thus we wouldn't have to perform so many checks.  In fact, I commented
these out and reran the profiled bit version; the results are in

  bits_io_nc_p.prof

Doing this cuts the difference in time between bit-based ops and
byte-based ops in half (this introduces a small bug with seekBin because
they use getBits bh 0 to align the bits; a quick when statement fixes
this).

Now, getBits seems to be calling writeFastMutInt too much, probably
because of the last case in the long if statement, where we essentially
cheat and call getByte, as:

            writeFastMutInt (bit_cache_r bh) 0
            writeFastMutInt (bit_off_r   bh) 0
            bit_cache <- getByte bh
            writeFastMutInt (bit_off_r   bh) (num_bits - leftover_bits)
            writeFastMutInt (bit_cache_r bh) (fromIntegral bit_cache)

the only reason we have to do these first two writes is to make sure that
getByte doesn't recurse and call getBits.  Unfortunately, this is probably
the most common case, so I will optimize this and essentially duplicate
the getByte code within there.  (Note that the version of Binary pre the
optimizations listed in this email is in BinaryPreOpt.hs.)  I've done this
by copying getWord8 to getByteNoBits which completely ignores the bit
stuff and thus is very unsafe.

putBits suffers from the same problem; namely:

                   writeFastMutInt (bit_off_r bh) 0
                   writeFastMutInt (bit_cache_r bh) 0
                   putByte bh (bit_cache .|. (bits `shiftL`
                          bit_off))    -- won't call putBits

so we duplicate putByte to putByteNoBits and call this instead.  The
results of profiling this version are in:

  bits_io_opt_p.prof

This change sped things up *significantly*, especially on reading.  To get
a full sense of where we are now, I rebuilt non-profiling versions (note
that the bytes version should be unaffected by this change, but just to be
sure I built it again).  The results are (averaged over three runs, 1m
data elements, all IO):

  bits   read    2031
         write   2582

  bytes  read    1694
         write   2006

So reading is now ~20% slower and writing is about 28% slower.  Not a huge
improvement.

(Offhand note: SimonM, I just noticed that when I rebuilt this, it didn't
rebuild SpeedTestBytes.hs even though Binary.hs had changed massively,
resulting in an error of linking...I'll send you a separate email about
this and see if I can reproduce it.  Remind me if I forget :P.)

Looking at bits_io_opt_p.prof, we see that putByteNoBits really dominates
the amount of time spent.  I noticed that putWord8 and friends are calling
readFastMutInt followed by WriteFastMutInt +1, so I changed this to
incFastMutInt, but I don't think this will make much of a
difference.  This is about as trimmed down as putByteNoBits will get, so
all the time it is taking must be from putChar.

I think that all that's left in terms of slowness are actually the
FastMutInts.  These are getting a much heavier pounding in the bit version
due to the fact that bit-offset and bit-cache are now fastmutints.  I
don't suppose there's a faster way to implement these?

 - Hal

--
Hal Daume III

 "Computer science is no more about computers    | hdaume@isi.edu
  than astronomy is about telescopes." -Dijkstra | www.isi.edu/~hdaume

On Thu, 19 Dec 2002, Simon Marlow wrote:
...
...
I tend to agree with Alastair.  I just reran the experiments 
with BinIOs
instead of BinMems to see if there was a greater difference 
(I expect a
much smaller difference, in fact, due to the high disk 
overhead and the
fact that putChar is slow).  I didn't both with -O0, but with 
-O2, we get:
bits   read     2156
         write    2639
bytes  read     1617
         write    2078
frankly, this shocks me.  reading using bits is 33% slower; writing is
about 27% slower.  i actually expected the bits version to be *faster*
(relatively, of course) here, due to the fact that we have 
much smaller
files (the byte-based file is 4200000 bytes while the 
bit-based file is
3325000 bytes).  With a 33% difference, I think I might want separate
instances again :).
Try profiling.  It's hard to tell where the speed is going without
seeing the code - would you like to put it up somewhere so we can take a
look?
Remember that putByte and friends have been carefully tuned, I suspect
we'll need to take a close look at the compiler output for your bit
versions and tweak a few things.
Cheers,
  Simon
_______________________________________________
Libraries mailing list
Libraries@haskell.org
http://www.haskell.org/mailman/listinfo/libraries