genBits: small addition to random API + largely new implementation that needs code review

[CC'ing some people involved in the relevant ticket discussions or showing up in git blame] Hello all, So it seems random was due for an overhaul. I would like to initiate a discussion period to talk about what changes should happen before the next major release. I think this is timely because there is *already* a pending backwards-incompatible change in the API (factoring out the SplittableGen class) http://hackage.haskell.org/trac/ghc/ticket/4314. We might as well make any other fixes now and make all the changes at once. For reference, I've attached a list of the open tickets I know about pertaining to System.Random at the end of this email. There implementation has two major pieces: (1) Sources of random bits (class RandomGen) (2) Instances of class Random that map use random bits to create Haskell types For now I've left (1) alone and have made heavy changes to part (2) in a branch to fix correctness and performance problems: https://github.com/haskell/random/tree/new_api These new changes require a small API change to RandomGen. The issue is that the legacy RandomGen API isn't very good at generating random *bits*. Rather, it next creates numbers in an arbitrary range: genRange. First of all, I would argue that -- completely aside from performance considerations -- this extra flexibility for RandomGens increases the likelihood of errors in the clients of the API. As supporting evidence I would cite tickets #5278 http://hackage.haskell.org/trac/ghc/ticket/5278 and #5280http://hackage.haskell.org/trac/ghc/ticket/5280 -- it looks like even use of the API in System/Random.hs itself was incorrect! *My proposed API addition/restriction:* I propose that we add a "genBits" function that reports how many bits of randomness are created by a generator. class RandomGen g where ... genBits :: g -> Int I added a default implementation of genBits that computes the answer based on genRange. This need not be a backwards-incompatible change. What about generators that have a genRange which is not a power of two? We could consider making genBits return a Maybe value to distinguish these. But I would prefer that we* *instead* restrict RandomGen instances, requiring that they generate a clean number of random bits*. This requires that genRange be (0,2^N-1) or (-2^N,2^N-1). But we can tolerate slight deviations; for example, in the new_api branchhttps://github.com/haskell/random/tree/new_api, stdGen has a genRange of (0,2^31 - 86). Probably close enough. (In any case, in the future I'd like to see stdGen replaced with something better anyway.) *An alternative would be to go even further and require every RandomGen instances to generate the full range of Ints.* I think any future replacement for stdGen will probably meet this additional requirement, though I would be interested in important generators that do not, and in any references on the original API design of RandomGen. As long as we have genBits I don't feel strongly about it but am interested in other opinions. Anyway, I've used genBits to reimplement the random and randomR methods. Let me know what you think. This new version should fix tickets #5278http://hackage.haskell.org/trac/ghc/ticket/5278, #5280 http://hackage.haskell.org/trac/ghc/ticket/5280, #5133http://hackage.haskell.org/trac/ghc/ticket/5133. But give me a code review and help me head off future new bugs! Regarding other tickets: #3575 http://hackage.haskell.org/trac/ghc/ticket/3575 and #3620http://hackage.haskell.org/trac/ghc/ticket/3620will have to wait for stdGen to get replaced. Whereas #427 http://hackage.haskell.org/trac/ghc/ticket/427 and #2280http://hackage.haskell.org/trac/ghc/ticket/2280 are complaints about the performance; they will also require a new stdGen to be adequately addressed. However, the current batch of changes *does* change the performance of some Random instances significantly. (See appendix.) There's plenty of room for improvement, and in particular, the new random approach would work better if #4102http://hackage.haskell.org/trac/ghc/ticket/4102 were implemented in Data.Bits. The excessive laziness problems are still there. (#4218http://hackage.haskell.org/trac/ghc/ticket/4218is still unfixed.) Perhaps that's something we can fix in this audit. I'd also be happy to hear other proposals for API changes and additions. System.Random can't depend on bytestring, so I doubt we want block RNG in the RandomGen class. I could, however, imagine having nextWord16, nextWord32, etc. Something worthwhile if we're doing an API audit is to look at some of the other Hackage packages. Here are some that I am familiar with: http://hackage.haskell.org/package/DRBG-0.2.2 http://hackage.haskell.org/package/mersenne-random-pure64 http://hackage.haskell.org/package/mwc-random http://hackage.haskell.org/package/intel-aes And some that I'm not: http://hackage.haskell.org/package/random-fu http://hackage.haskell.org/package/gsl-random http://hackage.haskell.org/package/cprng-aes Pardon the long email! Cheers, -Ryan *Appendix A: List of tickets* #5278 System.Random.randomIvalInteger makes invalid assumptions about RandomGen #5280 System.Random commits (rand `mod` base) error. #5133 Random instance for Float can generate values out of requested range #4218 System.Random is way too lazy #427 Random.StdGen slowness #2280 randomR too slow #3575 mkStdGen and split conspire to make some programs predictable #3620 The seeds generated by split are not independent libraries/rando #3352 Data.Word should define instances of Random for each type *Appendix B: Performance data* Here's an "S curve" showing the speedup or slowdown of the new code when generating different types of data with random or randomR (the R labeled ones are randomR). They range from from some massive speedups on floats and doubles as well as a hefty performance regression when Integer is restricted to a small range by randomR. (I'm looking into it.) speedup of new over old: new_api revision f85c6a55b731d5e380c1d6e880105c9933aa9d48 old, revision 130e421e912d394653a43c987be99 161.87843858371693 "Float" 120.29838090036381 "Float R" 54.66538723769602 "CDouble" 41.01849643134359 "CDouble R" 5.546651773234119 "Int" 5.250176145938512 "Integer" 2.4963231822924556 "Word16" 2.340365231384218 "Bool R" 2.2810769806267412 "Bool" 2.1576446209293216 "Double R" 2.059792819214518 "Double" 1.4283216783216783 "Integer R Big" 1.235562774008646 "Word16 R" * 1.008346493029544 "stdGen/next"* 0.713868706483777 "Int R" 0.6738365001610773 "Char R" 0.5974339051794343 "Char" 0.16546783206240073 "Integer R" The baseline stdGen/next hasn't been changed. Thus it's "speedup" is 1.0.

Ouch, pardon typos. I'm terrible with incomplete sentence edits, maybe I need STM to make them atomic ;-).
There implementation has two major pieces:
Rather, it next creates numbers in an arbitrary range: genRange. First of all, I
*An alternative would be to go even further and require every RandomGen instances to generate the full range of Ints.*

On Tue, Jun 28, 2011 at 02:58:00PM -0400, Ryan Newton wrote:
So it seems random was due for an overhaul. I would like to initiate a discussion period to talk about what changes should happen before the next major release. I think this is timely because there is *already* a pending backwards-incompatible change in the API (factoring out the SplittableGen class) http://hackage.haskell.org/trac/ghc/ticket/4314. We might as well make any other fixes now and make all the changes at once.
I believe the only reason we're shipping random with ghc is that haskell98 and haskell2010 depend on it. Perhaps we should fork random as old-random, only ship old-random (i.e. stop shipping random), but mark old-random as not maintained and encourage people to use random instead? That would make it a lot easier to make substantive changes to the random package, as we wouldn't have to keep adding compatibility hacks for the H98 and H2010 random libraries. This would probably imply the next Haskell' release not specifying System.Random. Any thoughts? Thanks Ian

Ian Lynagh
Ryan Newton wrote:
So it seems random was due for an overhaul. I would like to initiate a discussion period to talk about what changes should happen before the next major release. I think this is timely because there is *already* a pending backwards-incompatible change in the API (factoring out the SplittableGen class) http://hackage.haskell.org/trac/ghc/ticket/4314. We might as well make any other fixes now and make all the changes at once.
I believe the only reason we're shipping random with ghc is that haskell98 and haskell2010 depend on it. ... snip...
Perhaps we should fork random as old-random
This would probably imply the next Haskell' release not specifying System.Random.
Any thoughts?
H2010 already doesn't specify Random, unless I missed something. H98 needs patching already now that the Splittable patch was accepted (or did someone already do this?). I've never liked the "old-*" package renaming as that already introduces (minor) breakage and doesn't seem to encourage people to move forward. Why avoid using versioning and build-deps for this (adding a version bound is similar work to adding "old-")? Cheers, Thomas

On Tue, Jun 28, 2011 at 02:26:14PM -0700, Thomas DuBuisson wrote:
Ian Lynagh
wrote: Perhaps we should fork random as old-random
This would probably imply the next Haskell' release not specifying System.Random.
H2010 already doesn't specify Random, unless I missed something.
Ah, better still. random98 might be a better name, in that case.
I've never liked the "old-*" package renaming as that already introduces (minor) breakage and doesn't seem to encourage people to move forward. Why avoid using versioning and build-deps for this (adding a version bound is similar work to adding "old-")?
Mainly because I like to only have a single version of each package. Just having an old version of random is another option. Thanks Ian

Ryan Newton
There implementation has two major pieces: (1) Sources of random bits (class RandomGen) (2) Instances of class Random that map use random bits to create Haskell types
I'd say there are three pieces, initial entropy (clock, external seed, system crypto random generator), deterministic generator interface (PRNG, the RandomGen class), and the instances of that class. I tried to answer the first item in System.Crypto.Random. The 'random' package never really had an answer for entropy and I'm not sure what the community thinks about that. Perhaps answering this problem in slightly obscure packages is OK.
The issue is that the legacy RandomGen API isn't very good at generating random bits.
Agreed. This is why CryptoRandomGen uses ByteString (so we can generate any number of BYTES of random values). As I tried to argue when I make RandomGen more polymorphic, this is an issue with hardcoding the type as Int. Unfortunately, people didn't accept that alteration and I feel continuing to generating Int while allowing devs to discover the amount of entropy isn't taking things far enough.
I'd also be happy to hear other proposals for API changes and additions.
As I say below, I don't understand why we can't use ByteString. There are (or should be) rather fast decodings of all the popular primitive types from bytestrings, so this takes care of our polymorphism issue. I accept that, unlike the CryptoRandomGen, we don't want an explicit failure for RandomGen. But how about a common interface for instantiating generators ('newGen')? Is there a reason not to have that? --- CODE --- class RandomGen g where next :: g -> (Int, g) nextBits :: g -> BitLength -> (ByteString, g) genBits :: g -> Int newGen :: ByteString -> g --- END CODE --- Also, perhaps we could still have an explicit reseed? --- CODE --- class Reseedable g where reseed :: g -> Bytestring -> g --- END CODE --- I'll stop here before I entirely recreate CryptoRandomGen, but without the explicit errors and in more classes (the next step would be a method for querying how much entropy is needed for instantiation).
System.Random can't depend on bytestring, so I doubt we want block RNG in the RandomGen class.
Why can't it? System.Ranomd isn't part of H2010 and H98 already needs its own random module. I'll try to make time to review the code, you'll hear if I do. Cheers, Thomas

Well, my main motivation in the work over the last few days was to fix correctness problems in System.Random (and performance as secondary). I think this matters whether or not it is shipped only for Haskell 98 or not. I've been focused mainly on the Random instances. Does anyone know good points of comparison for these? I've seen lots of PRNGs on Hackage but not lots of code that maps that randomness onto the usual types (Double, Integer, Char, etc). My hope is to improve random to the point that it provides a good, simple default library. From a pedagogic perspective I imagine many new users want randomness but don't want to scour hackage and learn the more complex interfaces. So as long as 'random' is a clear default I am happy, whether or not it's part of the GHC distro or Haskell Platform. In my mind a good default implementation has decent-to-good performance, generates uniform distributions (#5278, #5280), and has statistically sound splitting. If we had a those three things maybe there would not have been as much motivation to provide new RNGs! *Re "nextBits":* Thomas, could we flesh out the performance argument for bulk random bit generation? It seems to me that if we do a good enough job with Data.Bits that an interface that produces a word at a time is not that bad in terms of throughput. (On the bit producer side, producing in bulk and then doling out word-at-a-time seemed to work pretty well for me when I was testing it in intel-aes.) The biggest problem I see is that a kind of redundant buffering could occur if both the bit-producer and the bit-consumer want bulk randomness but have to communicate it through individual words. That said, if bytestring dependency is ok -- why not? It then becomes a backwards-compatible change that may open an extra opportunity to some RNGs. *Re "newGen":* I like Thomas's interface and I've been thinking about the "newGen" issue as well: newGen :: ByteString -> g Here ByteString does seem most appropriate, anything else -- Int or [Int] -- seems unsatisfactory. Just for arguments sake I'd like to reexamine the "SplittableGen" topic in light of "newGen": http://www.haskell.org/pipermail/libraries/2010-September/014252.html It seems that with newGen we could provide a default implementation of split and keep it in RandomGen. Yes, it would be poor. But it would mean that things like Mersenne don't have to throw an error. Further, if stdGen were to, say, provide cryptographically strong RNG with sound splitting -- would the quality of the default split be as much of a concern? Someone who chooses anything other than stdGen in that case would have to have a good reason for doing so, and I think can be saddled with the consequences. Adding newGen instead of subtracting split would not be Haskell-98 backwards compatible. However, it would create a *different* kind of breakage than subtracting split. Subtracting split breaks the client code, whereas adding newGen just breaks the instances of RandomGen (hopefully few). Don't we want to be pushing splittable tree RNG *more* in the future? GHC for parallelism, right!? Let me give an example of the feature's value -- the Cilk folks are making significant changes to their runtime just to get deterministic parallel RNG: http://groups.csail.mit.edu/sct/wiki/index.php?title=Other_Projects#Determin... My feeling is that if we could get the best of what's out there into one library with a simple interface and made it default that it would have a positive impact and avoid unnecessary balkanization and do-it-yourself in the random number department. In order to accomplish this maybe we would need two stdGens. One maximally fast and one maximally sound -- perhaps Mersenne Twister and crypto-based splittable RNG? By the way, perhaps that "genBits" method should probably be rename "genRangeBits"? -Ryan On Tue, Jun 28, 2011 at 6:01 PM, Thomas DuBuisson < thomas.dubuisson@gmail.com> wrote:
Ryan Newton
wrote: There implementation has two major pieces: (1) Sources of random bits (class RandomGen) (2) Instances of class Random that map use random bits to create Haskell types
I'd say there are three pieces, initial entropy (clock, external seed, system crypto random generator), deterministic generator interface (PRNG, the RandomGen class), and the instances of that class.
I tried to answer the first item in System.Crypto.Random. The 'random' package never really had an answer for entropy and I'm not sure what the community thinks about that. Perhaps answering this problem in slightly obscure packages is OK.
The issue is that the legacy RandomGen API isn't very good at generating random bits.
Agreed. This is why CryptoRandomGen uses ByteString (so we can generate any number of BYTES of random values).
As I tried to argue when I make RandomGen more polymorphic, this is an issue with hardcoding the type as Int. Unfortunately, people didn't accept that alteration and I feel continuing to generating Int while allowing devs to discover the amount of entropy isn't taking things far enough.
I'd also be happy to hear other proposals for API changes and additions.
As I say below, I don't understand why we can't use ByteString. There are (or should be) rather fast decodings of all the popular primitive types from bytestrings, so this takes care of our polymorphism issue.
I accept that, unlike the CryptoRandomGen, we don't want an explicit failure for RandomGen. But how about a common interface for instantiating generators ('newGen')? Is there a reason not to have that?
--- CODE --- class RandomGen g where next :: g -> (Int, g) nextBits :: g -> BitLength -> (ByteString, g) genBits :: g -> Int newGen :: ByteString -> g --- END CODE ---
Also, perhaps we could still have an explicit reseed?
--- CODE --- class Reseedable g where reseed :: g -> Bytestring -> g --- END CODE ---
I'll stop here before I entirely recreate CryptoRandomGen, but without the explicit errors and in more classes (the next step would be a method for querying how much entropy is needed for instantiation).
System.Random can't depend on bytestring, so I doubt we want block RNG in the RandomGen class.
Why can't it? System.Ranomd isn't part of H2010 and H98 already needs its own random module.
I'll try to make time to review the code, you'll hear if I do.
Cheers, Thomas

Good point about the entropy source. I like how the mersenne-random-pure package for instance provides an IO function to get a RNG from the system clock: http://hackage.haskell.org/packages/archive/mersenne-random-pure64/0.2.0.3/d... I think it makes sense to have a similar call both as a place to put better entropy sources for certain platforms, and to make it maximally easy for the user to do the right thing. (I imagine there are many instances of people using a constant -- and probably one with lots of zeros -- to initialize mkStdGen). -Ryan On Tue, Jun 28, 2011 at 6:01 PM, Thomas DuBuisson < thomas.dubuisson@gmail.com> wrote:
Ryan Newton
wrote: There implementation has two major pieces: (1) Sources of random bits (class RandomGen) (2) Instances of class Random that map use random bits to create Haskell types
I'd say there are three pieces, initial entropy (clock, external seed, system crypto random generator), deterministic generator interface (PRNG, the RandomGen class), and the instances of that class.
I tried to answer the first item in System.Crypto.Random. The 'random' package never really had an answer for entropy and I'm not sure what the community thinks about that. Perhaps answering this problem in slightly obscure packages is OK.
The issue is that the legacy RandomGen API isn't very good at generating random bits.
Agreed. This is why CryptoRandomGen uses ByteString (so we can generate any number of BYTES of random values).
As I tried to argue when I make RandomGen more polymorphic, this is an issue with hardcoding the type as Int. Unfortunately, people didn't accept that alteration and I feel continuing to generating Int while allowing devs to discover the amount of entropy isn't taking things far enough.
I'd also be happy to hear other proposals for API changes and additions.
As I say below, I don't understand why we can't use ByteString. There are (or should be) rather fast decodings of all the popular primitive types from bytestrings, so this takes care of our polymorphism issue.
I accept that, unlike the CryptoRandomGen, we don't want an explicit failure for RandomGen. But how about a common interface for instantiating generators ('newGen')? Is there a reason not to have that?
--- CODE --- class RandomGen g where next :: g -> (Int, g) nextBits :: g -> BitLength -> (ByteString, g) genBits :: g -> Int newGen :: ByteString -> g --- END CODE ---
Also, perhaps we could still have an explicit reseed?
--- CODE --- class Reseedable g where reseed :: g -> Bytestring -> g --- END CODE ---
I'll stop here before I entirely recreate CryptoRandomGen, but without the explicit errors and in more classes (the next step would be a method for querying how much entropy is needed for instantiation).
System.Random can't depend on bytestring, so I doubt we want block RNG in the RandomGen class.
Why can't it? System.Ranomd isn't part of H2010 and H98 already needs its own random module.
I'll try to make time to review the code, you'll hear if I do.
Cheers, Thomas

On 2011-06-28 14:58, Ryan Newton wrote:
These new changes require a small API change to RandomGen. The issue is that the legacy RandomGen API isn't very good at generating random */bits/*. Rather, it next creates numbers in an arbitrary range: genRange. ... I propose that we add a "genBits" function that reports how many bits of randomness are created by a generator.
class RandomGen g where ... genBits :: g -> Int ... What about generators that have a genRange which is not a power of two?
My own feeling is against this change. That's not based on a lot. I once coded up a mechanism that could convert a stream of random numbers from a given sequence of ranges, to another stream of random numbers for ranges as demanded. No randomness was wasted. Nice, clean fun. I don't claim it was particularly efficient. Having worked with that, using bits to organize randomness seems arbitrary.

On June 28, 2011 20:31:23 Scott Turner wrote:
On 2011-06-28 14:58, Ryan Newton wrote:
I propose that we add a "genBits" function that reports how many bits of randomness are created by a generator.
class RandomGen g where
... genBits :: g -> Int
...
What about generators that have a genRange which is not a power of two?
My own feeling is against this change. That's not based on a lot. I once coded up a mechanism that could convert a stream of random numbers from a given sequence of ranges, to another stream of random numbers for ranges as demanded. No randomness was wasted. Nice, clean fun. I don't claim it was particularly efficient.
Having worked with that, using bits to organize randomness seems arbitrary.
I created a system that is like much like what you did. It was based around uniform draws on variable intervals and splitting and combining them. The key component was the following function (source',source_bound', value',value_bound') = combine(source,source_bound, value,value_bound, range) which is a uniform random integer combiner, where source, source', value, and value' are uniform random integers such that 0 <= source < source_bound 0 <= source' < source_bound' <= source_bound 0 <= value < value_bound <= range 0 <= value' < value_bound' <= range What it does is shift randomness out of source and into value. This would increase the source_bound and decrease value_bound. To generate a uniform number on an arbitrary interval, you just call it repeatedly until source_bound reached your desired bound, refreshing value with a new random number from your generator between calls anytime value_bound dropped to 1. The interesting thing about this is the only constraint on the underlying generator is it had to generate uniform numbers. The interval they were generated on didn't matter at all. They could even change with each call. It is also interesting in that, properly coded, it can waste no randomness by putting any left over randomness back into value for the next round. You can also code it up pretty efficiently. It is basically all just integer multiplication (combining intervals) and division and remainder operations (splitting intervals). With a bit of care you can also isolate the source and value types (e.g., source could be 8 bit and value could be 32bit). Still, for a very cheap source generating on a larger interval than you desired one, it likely wouldn't compete with the classic modulus, test, and throw away and repeat if not acceptable approach. Way more flexible though. Cheers! -Tyson PS: combine is pretty easy to define in terms of (source',source_bound', value',value_bound') = split(source,source_bound, value_bound) which is a uniform random integer splitter, where source, source', and value' are uniform random integers are such that 0 <= source < source_bound 0 <= source' < source_bound' <= source_bound 0 <= value' < value_bound' <= value_bound What it does is split off a uniform draw on an interval up to the one requested form value. I can provide full pseudo code if anyone wants.

Still, for a very cheap source generating on a larger interval than you desired one, it likely wouldn't compete with the classic modulus, test, and throw away and repeat if not acceptable approach. Way more flexible though.
This was educational. I didn't know that approach was "classic". I used it because it seemed intuitive and I checked and saw that GMP does the same thing for *mpz_urandomm*. But I like the idea of using that leftover randomness!
participants (5)
-
Ian Lynagh
-
Ryan Newton
-
Scott Turner
-
Thomas DuBuisson
-
Tyson Whitehead