Re: [Haskell-cafe] Splittable random numbers

31 Jan 2011

      Hi Cafe,

I've included Gladman's efficient, portable C implementation of AES
generating random numbers now, and the hardware-accelerated version is
working too.  I'm currently seeing higher throughput for even the software
based version than the builtin stdGen:

  First, timing with System.Random interface:
     13,051,964 random ints generated [System.Random stdGen]      ~ 252
cycles/int
         15,635 random ints generated [PureHaskell/reference]     ~ 210,763
cycles/int
         31,159 random ints generated [PureHaskell]               ~ 105,757
cycles/int
      2,180,488 random ints generated [Gladman inefficient]       ~ 1,511
cycles/int
     15,015,095 random ints generated [Gladman]                   ~ 219
cycles/int

That seems like a good argument for cryptographic RNGs to me!

I'm having a lot of trouble getting cabal to build/install it successfully.
You can see what I've got there now.  I'd be interested to know if anyone
else can build it successfully.  It should work -- but only by building the
assembly code into a .so and assuming the build directory is /opt/intel-aes
;-).

I don't have real numbers for the hardware version yet because the Westmere
machine I'm logged into is redhat 5.4 and is giving me "GLIBC_2.7 not found"
errors.  You can run it for correctness purposes using an emulation tool
called sde (software development
emulator)http://software.intel.com/en-us/articles/intel-software-development-emulator...that's
based on dynamic binary translation.

-Ryan

P.S. Checkout command:
git clone git://github.com/rrnewton/intel-aes.git

On Sat, Jan 29, 2011 at 8:52 AM, Ryan Newton  wrote:
...
perhaps performance? Is this approach less robust with a faster,
...
non-cryptographic RNG?
Yes, I don't understand that either.  Is there a reason that using a weaker
PRNG in this case is WORSE than using it in the non-splitting case?  Is that
why there is more of an impetus to use the cryptographic approach in this
case?
Anyway, taking for granted that the Burton approach is a useful thing to
have implemented, I started developing a package for this stuff -- AES based
RNG including both a haskell implementation and wrapping an AESNI-based C
one .  I haven't posted it to Hackage yet, but you can find the git
repository here:
https://github.com/rrnewton/intel-aes
If you build with cabal and run the benchmark-intel-aes-rng executable, it
will give you a breakdown like this:
How many random numbers can we generate in a second on one thread?
      Cost of rdtsc (ffi call):    83
      Approx getCPUTime calls per second: 205,640
      Approx clock frequency:  3,306,891,339
      First, timing with System.Random interface:
       193,178,901 random ints generated [constant zero gen]
        14,530,358 random ints generated [System.Random stdGen]
            16,346 random ints generated [BurtonGenSlow/reference]
            32,965 random ints generated [BurtonGenSlow]
      Comparison to C's rand():
       118,766,285 random ints generated [rand/store in C loop]
       114,668,028 random ints generated [rand / Haskell loop]
       114,675,116 random ints generated [rand/store Haskell]
At the moment this is Haskell-only, I haven't included the wrapped Intel
assembly code library yet.  As you can see, the pure-Haskell AES based RNG
(BurtonGenSlow) is pretty slow.
Would anyone else be interested in running those RNG testing tools
(diehard, big crush) on this to make sure it is working correctly?
Also I'd be happy if anyone with a performance-oriented-eye would like to
take a look at what's going on.  Both for the sake of the serial performance
(above) and because the parallel performance is currently *mysterious*(see below).
I figure one of the main reasons for splittable RNG is deterministic
parallel computations.  Thus it's important that all threads be able to run
the RNG efficiently.  Right now, if you look at SimpleRNGBench.hs I'm just
running the same RNG on numCapabilities threads.  Yet with that simple test
I'm running into problems, summarized thus:
* I get substantial (3X) variance in program performance on consecutive
runs.
  * I see a minor hit in performance adding -threaded, but going from -N1
to -N4 (even with a single-thread of work) yields a big hit in performance
and increase in variance.
  * -N4 with four actual threads of work is actually pretty good for the
pure haskell version.  All four threads on my nehalem 3.33ghz can maintain
93% of their throughput in the serial case.  BUT the variance problem
persists.
  * I run a busy-wait loop that measures cpu frequency... and this seems to
get messed up in threaded mode (even with -qm -qa).  I don't know why.
  * I cannot killThread a haskell thread (forkIO or forkOS) that is
currently running a divergent FFI call (safe or unsafe).  (See "time_c".)
You can find the details in the DEVLOG here:
https://github.com/rrnewton/intel-aes/blob/master/CHANGELOG
Let me know if you have any ideas.  I'm going to leave the Haskell version
how it is and focus on wrapping the Intel asm (which has a permissive
license).
Cheers,
  -Ryan
P.S. Regarding this benchmarking -- would it be appropriate to use
Criterion for this?  Or is it sufficient to measure aggregate throughput as
I've been doing?