perhaps performance? Is this approach less robust with a faster,
non-cryptographic RNG?

Yes, I don't understand that either. Is there a reason that using a weaker PRNG in this case is WORSE than using it in the non-splitting case? Is that why there is more of an impetus to use the cryptographic approach in this case?

Anyway, taking for granted that the Burton approach is a useful thing to have implemented, I started developing a package for this stuff -- AES based RNG including both a haskell implementation and wrapping an AESNI-based C one . I haven't posted it to Hackage yet, but you can find the git repository here:

    https://github.com/rrnewton/intel-aes

If you build with cabal and run the benchmark-intel-aes-rng executable, it will give you a breakdown like this:

    How many random numbers can we generate in a second on one thread?
      Cost of rdtsc (ffi call):    83
      Approx getCPUTime calls per second: 205,640
      Approx clock frequency: 3,306,891,339
      First, timing with System.Random interface:
   193,178,901 random ints generated [constant zero gen]
      14,530,358 random ints generated [System.Random stdGen]
    16,346 random ints generated [BurtonGenSlow/reference]
            32,965 random ints generated [BurtonGenSlow]
      Comparison to C's rand():
   118,766,285 random ints generated [rand/store in C loop]
   114,668,028 random ints generated [rand / Haskell loop]
   114,675,116 random ints generated [rand/store Haskell]

At the moment this is Haskell-only, I haven't included the wrapped Intel assembly code library yet. As you can see, the pure-Haskell AES based RNG (BurtonGenSlow) is pretty slow.

Would anyone else be interested in running those RNG testing tools (diehard, big crush) on this to make sure it is working correctly?

Also I'd be happy if anyone with a performance-oriented-eye would like to take a look at what's going on. Both for the sake of the serial performance (above) and because the parallel performance is currently mysterious (see below).

I figure one of the main reasons for splittable RNG is deterministic parallel computations. Thus it's important that all threads be able to run the RNG efficiently. Right now, if you look at SimpleRNGBench.hs I'm just running the same RNG on numCapabilities threads. Yet with that simple test I'm running into problems, summarized thus:

* I get substantial (3X) variance in program performance on consecutive runs.
* I see a minor hit in performance adding -threaded, but going from -N1 to -N4 (even with a single-thread of work) yields a big hit in performance and increase in variance.
* -N4 with four actual threads of work is actually pretty good for the pure haskell version. All four threads on my nehalem 3.33ghz can maintain 93% of their throughput in the serial case. BUT the variance problem persists.
* I run a busy-wait loop that measures cpu frequency... and this seems to get messed up in threaded mode (even with -qm -qa). I don't know why.
* I cannot killThread a haskell thread (forkIO or forkOS) that is currently running a divergent FFI call (safe or unsafe). (See "time_c".)

You can find the details in the DEVLOG here:

   https://github.com/rrnewton/intel-aes/blob/master/CHANGELOG

Let me know if you have any ideas. I'm going to leave the Haskell version how it is and focus on wrapping the Intel asm (which has a permissive license).

Cheers,
-Ryan

P.S. Regarding this benchmarking -- would it be appropriate to use Criterion for this? Or is it sufficient to measure aggregate throughput as I've been doing?