
Quoting the Haddock documentation for Data.Bits.bitSize: "Return the number of bits in the type of the argument. The actual value of the argument is ignored. The function bitSize is undefined for types that do not have a fixed bitsize, like Integer." Does anybody else think it would be *far* more useful if bitSize applied to an Integer would tell you how many bits that particular Integer is using? Especially given that it can vary? Is there a way to actually determine how many bits are in an Integer?

On Thursday 25 August 2011, 19:57:37, Andrew Coppin wrote:
Quoting the Haddock documentation for Data.Bits.bitSize:
"Return the number of bits in the type of the argument. The actual value of the argument is ignored. The function bitSize is undefined for types that do not have a fixed bitsize, like Integer."
Does anybody else think it would be *far* more useful if bitSize applied to an Integer would tell you how many bits that particular Integer is using? Especially given that it can vary?
I'm not sure about that.
Is there a way to actually determine how many bits are in an Integer?
Sure. The exact method depends on what result you want and which integer-* package you use. You have to handle 0 and negative n first (how to treat negative n is not obvious). Then, for n > 0, f you want 'index of highest set bit' (+1), (1 +) integerLogBase 2 n does the trick (integerLogBase is available from GHC.Float, as of 7.2.1, it's decently fast). If you want 'WORD_SIZE_IN_BITS * number of words used', you could use the above and round up to the next multiple of WORD_SIZE_IN_BITS, or you could make use of the representation of Integers; with integer-gmp, data Integer = S# Int# | J# Int# ByteArray# so usedWords (S# _) = 1 usedWords (J# s# _) = I# s# (nota bene, I don't think it's positively guaranteed that the highest order word in a (J# _ _) is nonzero, so usedWords could give a higher answer than rounding up from integerLogBase 2 n).

On 11-08-25 01:57 PM, Andrew Coppin wrote:
Does anybody else think it would be *far* more useful if bitSize applied to an Integer would tell you how many bits that particular Integer is using? Especially given that it can vary?
It is useful to know the number of bits used in a value. It is useful to know the maximum number of bits storable in a type. It is useless to conflate the two into one single method. The name "bitSize" is already taken. Come up with another name, and I will agree with you.

And as Daniel mentioned earlier, it's not at all obvious what we mean by
"bits used" when it comes to negative numbers. GMP pretends that any
negative number has infinite bits in the two's-complement representation.
Should we count the highest _unset_ bit when the number is negative?
Or to put it another way, what invariants should the proposed bit-counting
function satisfy?
On Thu, Aug 25, 2011 at 6:32 PM, Albert Y. C. Lai
On 11-08-25 01:57 PM, Andrew Coppin wrote:
Does anybody else think it would be *far* more useful if bitSize applied to an Integer would tell you how many bits that particular Integer is using? Especially given that it can vary?
It is useful to know the number of bits used in a value.
It is useful to know the maximum number of bits storable in a type.
It is useless to conflate the two into one single method.
The name "bitSize" is already taken. Come up with another name, and I will agree with you.
______________________________**_________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/**mailman/listinfo/haskell-cafehttp://www.haskell.org/mailman/listinfo/haskell-cafe

On 26/08/2011 02:40 AM, Daniel Peebles wrote:
And as Daniel mentioned earlier, it's not at all obvious what we mean by "bits used" when it comes to negative numbers.
I guess part of the problem is that the documentation asserts that bitSize will never depend on its argument. (So would will write things like "bitSize undefined :: ThreadID" or similar.) I can think of several possible results one might want from a bit size query: 1. The number of bits of precision which will be kept for values of this type. (For Word16, this is 16. For Integer, this is [almost] infinity.) 2. The amount of RAM that this value is using up. (But that would surely be measured in bytes, not bits. And processor registors make the picture more complicated.) 3. The bit count to the most significant bit, ignoring sign. 4. The bit count to the sign bit. Currently, bitSize implements #1. I'm not especially interested in #2. I would usually want #3 or #4. Consider the case of 123 (decimal). The 2s complement representation of +123 is ...0000000000000001111011 The 2s complement representation of -123 is ...1111111111111110000101 For query #3, I would expect both +123 and -123 to yield 7. For query #4, I would expect both to yield 8. (Since if you truncate both of those strings to 8 bits, then the positive value starts with 0, and the negative one start with 1.) Then of course, there's the difference between "count of the bits" and "bit index", which one might expect to be zero-based. (So that the Nth bit represents 2^N.)

On Friday 26 August 2011, 19:24:37, Andrew Coppin wrote:
On 26/08/2011 02:40 AM, Daniel Peebles wrote:
And as Daniel mentioned earlier, it's not at all obvious what we mean by "bits used" when it comes to negative numbers.
I guess part of the problem is that the documentation asserts that bitSize will never depend on its argument. (So would will write things like "bitSize undefined :: ThreadID" or similar.)
I don't think that's a problem, it's natural for what bitSize does. And it would be bad if bitSize did something different for Integer than for Int, Word, ...
I can think of several possible results one might want from a bit size query:
Yup, though for some there are better names than bitSize would be.
1. The number of bits of precision which will be kept for values of this type. (For Word16, this is 16. For Integer, this is [almost] infinity.)
Not "almost infinity", what your RAM or Int allow, whichever cops out first, or "enough, unless you [try to] do really extreme stuff".
2. The amount of RAM that this value is using up. (But that would surely be measured in bytes, not bits. And processor registors make the picture more complicated.)
3. The bit count to the most significant bit, ignoring sign.
4. The bit count to the sign bit.
Currently, bitSize implements #1. I'm not especially interested in #2. I would usually want #3 or #4.
I'd usually be more interested in #2 than in #4.
Consider the case of 123 (decimal). The 2s complement representation of +123 is
...0000000000000001111011
The 2s complement representation of -123 is
...1111111111111110000101
For query #3, I would expect both +123 and -123 to yield 7.
One could make a case for the answer 3 for -123, I wouldn't know what to expect without it being stated in the docs.
For query #4, I would expect both to yield 8. (Since if you truncate both of those strings to 8 bits, then the positive value starts with 0, and the negative one start with 1.)
#4 would then generally be #3 + 1 for signed types, I think, so not very interesting, but for unsigned types?
Then of course, there's the difference between "count of the bits" and "bit index", which one might expect to be zero-based. (So that the Nth bit represents 2^N.)
Yes, but that shouldn't be a problem with good names. So, which of them are useful and important enough to propose for inclusion?

On Fri, 26 Aug 2011 18:24:37 +0100, you wrote:
I would usually want #3 or #4.
Out of curiosity, what for? While I do occasionally need to get a "logarithmic size estimate" of a number (which is basically what #3 and #4 are), the specific requirements in each case tend to vary, enough so that it's unlikely that a single function (other than simply taking the logarithm) can handle the majority of applications. -Steve Schafer

On 26/08/2011 07:36 PM, Steve Schafer wrote:
On Fri, 26 Aug 2011 18:24:37 +0100, you wrote:
I would usually want #3 or #4.
Out of curiosity, what for? While I do occasionally need to get a "logarithmic size estimate" of a number (which is basically what #3 and #4 are), the specific requirements in each case tend to vary, enough so that it's unlikely that a single function (other than simply taking the logarithm) can handle the majority of applications.
You wouldn't want to know how many bits you need to store on disk to reliably recreate the value? Or how many bits of randomness you need to compute a value less than or equal to this one? I suppose I could use a binary logarithm. I'm just concerned that it would be rather slow. After all, I'm not interested in the exact logarithm (which is fractional), just the number of bits (which is a small integer)...

On Friday 26 August 2011, 21:30:02, Andrew Coppin wrote:
You wouldn't want to know how many bits you need to store on disk to reliably recreate the value? Or how many bits of randomness you need to compute a value less than or equal to this one?
I suppose I could use a binary logarithm. I'm just concerned that it would be rather slow. After all, I'm not interested in the exact logarithm (which is fractional), just the number of bits (which is a small integer)...
As of GHC-7.2, there's GHC.Integer.Logarithms in both, integer-gmp and integer-simple, providing integerLog2# :: Integer -> Int# (integer-* lives before base, so there's no Int yet) which exploits the representation of Integers and should be fast enough [at least for integer-gmp, where it's bounded time for normally represented values, the representation of Integers in integer-simple forces it to be O(log n)]. Caution: integerLog2# expects its argument to be positive, havoc might ensue if it isn't. GHC.Float exports integerLogBase :: Integer -> Integer -> Int also before 7.2, now it calls the above if the base is 2, so should have decent performance. (It requires that the base is > 1 and the second argument positive, of course.)

On Fri, 26 Aug 2011 20:30:02 +0100, you wrote:
You wouldn't want to know how many bits you need to store on disk to reliably recreate the value?
I can't say that I have cared about that sort of thing in a very long time. Bits are rather cheap these days. I store data on disk, and the space it occupies is whatever it is; I don't worry about it.
Or how many bits of randomness you need to compute a value less than or equal to this one?
I'm not sure I understand this one. I generally don't need _any_ randomness to compute a value. On the other hand, if the desire is to take an integer n and generate a set of pseudorandom numbers ranging from 0 to n-1, that's easily done using the standard random number methods. -Steve Schafer

On 26/08/2011 10:51 PM, Steve Schafer wrote:
On Fri, 26 Aug 2011 20:30:02 +0100, you wrote:
You wouldn't want to know how many bits you need to store on disk to reliably recreate the value?
I can't say that I have cared about that sort of thing in a very long time. Bits are rather cheap these days. I store data on disk, and the space it occupies is whatever it is; I don't worry about it.
I meant if you're trying to *implement* serialisation. The Bits class allows you to access bits one by one, but surely you'd want some way to know how many bits you need to keep? Likewise, you say the standard PRNG can be used to generate random Integers, but what if you're trying to implement a new PRNG?

On Sat, 27 Aug 2011 11:57:57 +0100, you wrote:
I meant if you're trying to *implement* serialisation. The Bits class allows you to access bits one by one, but surely you'd want some way to know how many bits you need to keep?
For fixed-size types (e.g., Int), I might use a simple byte-for-byte serialization. But these days, I tend to avoid binary serializations, and use string conversions for all types, taking advantage of whatever built-in conversions the language offers. There's obviously more overhead, but the advantages usually outweigh the disadvantages. For one thing, I can come back a couple of years later and still figure out what the data are supposed to be.
Likewise, you say the standard PRNG can be used to generate random Integers, but what if you're trying to implement a new PRNG?
I'm not aware of of any PRNGs that use variable-length types (and I would think that such a PRNG wouldn't be very efficient), so I'm still not sure that I understand the question. -Steve Schafer

On Sat, Aug 27, 2011 at 06:57, Andrew Coppin
On 26/08/2011 10:51 PM, Steve Schafer wrote:
On Fri, 26 Aug 2011 20:30:02 +0100, you wrote:
You wouldn't want to know how many bits you need to store on disk to reliably recreate the value?
I can't say that I have cared about that sort of thing in a very long time. Bits are rather cheap these days. I store data on disk, and the space it occupies is whatever it is; I don't worry about it.
I meant if you're trying to *implement* serialisation. The Bits class allows you to access bits one by one, but surely you'd want some way to know how many bits you need to keep?
I think that falls into the realm of protocol design; if you're doing it in your program at runtime, you're probably doing it wrong. (The fixed size version makes sense for marshaling; it's *dynamic* sizes that need to be thought out beforehand.) -- brandon s allbery allbery.b@gmail.com wandering unix systems administrator (available) (412) 475-9364 vm/sms

I meant if you're trying to *implement* serialisation. The Bits class allows you to access bits one by one, but surely you'd want some way to know how many bits you need to keep?
I think that falls into the realm of protocol design; if you're doing it in your program at runtime, you're probably doing it wrong. (The fixed size version makes sense for marshaling; it's *dynamic* sizes that need to be thought out beforehand.)
If you're doing, say, cryptography, then thousand-bit random integers that need to be serialised are fairly common...

On Mon, Aug 29, 2011 at 03:40, Andrew Coppin
I meant if you're trying to *implement* serialisation. The Bits
class allows you to access bits one by one, but surely you'd want some way to know how many bits you need to keep?
I think that falls into the realm of protocol design; if you're doing it in your program at runtime, you're probably doing it wrong. (The fixed size version makes sense for marshaling; it's *dynamic* sizes that need to be thought out beforehand.)
If you're doing, say, cryptography, then thousand-bit random integers that need to be serialised are fairly common...
Sure, and there are encodings that let you do this without needing bitSize (BER). You need a word count, but that's usually part of the structure holding the integer. -- brandon s allbery allbery.b@gmail.com wandering unix systems administrator (available) (412) 475-9364 vm/sms

On 29/08/2011 09:00 AM, Brandon Allbery wrote:
On Mon, Aug 29, 2011 at 03:40, Andrew Coppin
mailto:andrewcoppin@btinternet.com> wrote: I meant if you're trying to *implement* serialisation. The Bits class allows you to access bits one by one, but surely you'd want some way to know how many bits you need to keep?
I think that falls into the realm of protocol design; if you're doing it in your program at runtime, you're probably doing it wrong. (The fixed size version makes sense for marshaling; it's *dynamic* sizes that need to be thought out beforehand.)
If you're doing, say, cryptography, then thousand-bit random integers that need to be serialised are fairly common...
Sure, and there are encodings that let you do this without needing bitSize (BER). You need a word count, but that's usually part of the structure holding the integer.
OK. But since there's no way of getting a byte count for an Integer either...

On Mon, Aug 29, 2011 at 04:32, Andrew Coppin
OK. But since there's no way of getting a byte count for an Integer either...
The count depends on how you're serializing it; unless you are literally serializing to individual bits and sending them over a synchronous serial link, the correct answer is an integer logarithm by your word size + the current definition of bitSize applied to said word. -- brandon s allbery allbery.b@gmail.com wandering unix systems administrator (available) (412) 475-9364 vm/sms

On Mon, 29 Aug 2011 08:40:45 +0100, you wrote:
If you're doing, say, cryptography, then thousand-bit random integers that need to be serialised are fairly common...
This is the part that makes no sense to me. Yes, you are absolutely correct that large, multiple-byte integers play a big role in cryptography. But those are invariably FIXED LENGTH multiple-byte integers. As I mentioned before, to the best of my knowledge, no one uses variable-size representations in those kinds of computationally-intensive applications. -Steve Schafer

On 27 August 2011 21:57, Brandon Allbery
On Sat, Aug 27, 2011 at 06:57, Andrew Coppin
wrote: On 26/08/2011 10:51 PM, Steve Schafer wrote:
On Fri, 26 Aug 2011 20:30:02 +0100, you wrote:
You wouldn't want to know how many bits you need to store on disk to reliably recreate the value?
I can't say that I have cared about that sort of thing in a very long time. Bits are rather cheap these days. I store data on disk, and the space it occupies is whatever it is; I don't worry about it.
I meant if you're trying to *implement* serialisation. The Bits class allows you to access bits one by one, but surely you'd want some way to know how many bits you need to keep?
I think that falls into the realm of protocol design; if you're doing it in your program at runtime, you're probably doing it wrong. (The fixed size version makes sense for marshaling; it's *dynamic* sizes that need to be thought out beforehand.)
All search engines deal with compressed integers, all compressors do, and most people doing bit-manipulation. Golomb, gamma, elias, rice coding, they all need this. Heck, even the Intel engineers chose to optimize this function by including the BSR instruction in the 386 architecture. This is a basic building block. Don't underestimate the bit, it is coming back with a vengeance. Bit-coding is everywhere now, because of the memory hierarchy. No haskeller should be left behind. Alexander

On Mon, Aug 29, 2011 at 04:08, Alexander Kjeldaas < alexander.kjeldaas@gmail.com> wrote:
All search engines deal with compressed integers, all compressors do, and most people doing bit-manipulation. Golomb, gamma, elias, rice coding, they all need this. Heck, even the Intel engineers chose to optimize this function by including the BSR instruction in the 386 architecture. This is a basic building block.
Don't underestimate the bit, it is coming back with a vengeance. Bit-coding is everywhere now, because of the memory hierarchy. No haskeller should be left behind.
Is it so basic that we must throw out the current meaning of bitSize, with all *its* uses, to satisfy yours? -- brandon s allbery allbery.b@gmail.com wandering unix systems administrator (available) (412) 475-9364 vm/sms

On Fri, 2011-08-26 at 20:30 +0100, Andrew Coppin wrote:
On 26/08/2011 07:36 PM, Steve Schafer wrote:
On Fri, 26 Aug 2011 18:24:37 +0100, you wrote:
I would usually want #3 or #4.
Out of curiosity, what for? While I do occasionally need to get a "logarithmic size estimate" of a number (which is basically what #3 and #4 are), the specific requirements in each case tend to vary, enough so that it's unlikely that a single function (other than simply taking the logarithm) can handle the majority of applications.
You wouldn't want to know how many bits you need to store on disk to reliably recreate the value? Or how many bits of randomness you need to compute a value less than or equal to this one?
I suppose I could use a binary logarithm. I'm just concerned that it would be rather slow. After all, I'm not interested in the exact logarithm (which is fractional), just the number of bits (which is a small integer)...
According to random side (http://gruntthepeon.free.fr/ssemath/) not so new computers can compute 15.5 milions of serial logarithms per second (62 millions in total). I'd say that overhead of Integer might be much bigger then cost of logarithm. Best regards

On Monday 29 August 2011, 12:32:51, Maciej Marcin Piechotka wrote:
On Fri, 2011-08-26 at 20:30 +0100, Andrew Coppin wrote:
I suppose I could use a binary logarithm. I'm just concerned that it would be rather slow. After all, I'm not interested in the exact logarithm (which is fractional), just the number of bits (which is a small integer)...
According to random side (http://gruntthepeon.free.fr/ssemath/) not so new computers can compute 15.5 milions of serial logarithms per second (62 millions in total). I'd say that overhead of Integer might be much bigger then cost of logarithm.
Well, converting the Integer to Double can easily take longer than calculating the logarithm. The main problem with this approach, however, is that only smallish (cryptographically speaking) Integers can be converted to Double with something resembling adequate precision (above 2^1024-2^972, you'll get Infinity from the conversion, log is Infinity: boom).

On 29/08/2011, at 10:32 PM, Maciej Marcin Piechotka wrote:
According to random side (http://gruntthepeon.free.fr/ssemath/) not so new computers can compute 15.5 milions of serial logarithms per second (62 millions in total). I'd say that overhead of Integer might be much bigger then cost of logarithm.
That's floating-point logarithms, not Integer logarithms. Single-precision floats, at that. The code in question does not link at optimisation level 4. At least some of the benchmark results are impossible to believe: benching cephes_sinf .. -> 12762.3 millions of vector evaluations/second -> 0 cycles/value on a 2000MHz computer

That's reasonably believable – streaming units on current CPUs can execute multiple floating point operations per cycle. if (*ra4 != 0xffc78948) { return false; } On 30 Aug 2011, at 02:30, Richard O'Keefe wrote:
On 29/08/2011, at 10:32 PM, Maciej Marcin Piechotka wrote:
According to random side (http://gruntthepeon.free.fr/ssemath/) not so new computers can compute 15.5 milions of serial logarithms per second (62 millions in total). I'd say that overhead of Integer might be much bigger then cost of logarithm.
That's floating-point logarithms, not Integer logarithms. Single-precision floats, at that.
The code in question does not link at optimisation level 4. At least some of the benchmark results are impossible to believe: benching cephes_sinf .. -> 12762.3 millions of vector evaluations/second -> 0 cycles/value on a 2000MHz computer

On 30/08/2011, at 7:45 PM, Thomas Davie wrote:
That's reasonably believable – streaming units on current CPUs can execute multiple floating point operations per cycle.
The figures for cephes_{sinf,cosf} are difficult to believe because they are so extremely at variance with the figures that come with the software. First off, the program as delivered, when compiled, reported that the computer was a "2000 MHz" one. It is in fact a 2.66 GHz one. That figure turns out to be a #define in the code. Fix that, and the report includes benching cephes_sinf .. -> 12779.4 millions of vector evaluations/second -> 0 cycles/value on a 2660MHz computer benching cephes_cosf .. -> 12756.8 millions of vector evaluations/second -> 0 cycles/value on a 2660MHz computer benching cephes_expf .. -> 7.7 millions of vector evaluations/second -> 86 cycles/value on a 2660MHz computer The internal documentation in the program claims the following results on a 2.4 GHz machine: benching cephes_sinf .. -> 11.6 millions of vector evaluations/second -> 56 cycles/value on a 2600MHz computer benching cephes_cosf .. -> 8.7 millions of vector evaluations/second -> 74 cycles/value on a 2600MHz computer benching cephes_expf .. -> 3.7 millions of vector evaluations/second -> 172 cycles/value on a 2600MHz computer It seems surpassing strange that code compiled by gcc 4.2 on a 2.66 GHz machine should run more than a thousand times faster than code compiled by gcc 4.2 on a 2.60 GHz machine with essentially the same architecture. Especially as those particular functions are *NOT* vectorised. They are foils for comparison with the functions that *ARE* vectorised, for which entirely credible 26.5 million vector evaluations per second (or 25 cycles per value) are reported. Considering that sin and cos are not single floating point operations, 25 cycles per value is well done. 28 cycles per single-precision logarithm is not too shabby either, IF one can trust a benchmark that has blown its credibility as badly as this one has. But it's still not an Integer logarithm, just a single precision floating point one.
participants (11)
-
Albert Y. C. Lai
-
Alexander Kjeldaas
-
Andrew Coppin
-
Brandon Allbery
-
Daniel Fischer
-
Daniel Peebles
-
Gabor Greif
-
Maciej Marcin Piechotka
-
Richard O'Keefe
-
Steve Schafer
-
Thomas Davie