Replacement for GMP: Update

Peter Tanski

10 Aug 2006 10 Aug '06

5:31 a.m.

Simon PJ, Simon, Esa and John, Here is an update on what I have been doing so far in making a grand attempt to replace GMP. (1) evaluate replacement libraries LibTomMath: Pros- * has all the operators GMP offered Cons- * slow; about half as fast as OpenSSL in my tests for simple mathematical operations, much slower if you account for time to write or resize memory. (There is another MP library, which LibTomMath is based on, that is also very slow--student work) * not even reentrant; needs significant modification * beta-release; needs a bit of work to get it to production level * configuration needs to be written (not really portable, messy) ARPREC: Pros- * very fast (essentially does everything through the Floating Point Unit of a CPU) * complex mathematical operations * very precise * already thread safe (through C++ thread-safe statics) Cons- * no bitwise operations (not even left and right-shifts) * difficult configuration (everything runs by setting a precision level; ("precision level" ~= number of words (doubles) in array) it does not automatically resize memory; conversion from MP Real to Integer relies specially on careful precision-level) * memory inefficient (underestimates the number of real digits you can fit into a double, i.e., a 64-bit double has 48 bits of precision, holding about 9.6 digits per byte, resulting in an 848-byte array to hold an MP number with 1000 digits). OpenSSL: Crypto++ (http://www.eskimo.com/~weidai/cryptlib.html): Botan (http://botan.randombit.net/): Pros- * all of these are fast, since all use Integers to support cryptography; (Crypto++ and Botan are C++ crypto-libraries, all licenses good) * all of these provide most basic mathematical operations Cons- * none of these provide &, |, ^(xor) bitwise operators * Botan has least mathematical operators of the three * none provide lcm operator * all would realistically have to have the Integer libraries stripped out of the distribution and repackaged for GHC Summary: I finally settled on modifying OpenSSL, since that would be the easiest to work with under GHC's hood (plain C code, not C++). I have to: a. make OpenSSL's BN thread-safe (add thread local storage, at least) b. optimally add configure-conditional parallelism to BN (through PVM) c. add &, |, ^, lcm and a few other operations d. separate the BN from the rest of OpenSSL and rename the symbols to avoid conflicts (necessary because I have to modify the library anyway) (2) work on GHC: * finally understand C--; know what I need to modify * work through Makefiles: touch and go; I haven't mapped out all the variable settings from configure.in on down when it comes to DLLs Comment: for the Makefile in ghc/rts, in lines 300-346, GC_HC_OPTS += -optc-O3 --isn't this problematic? gcc, from -O2 on includes -fgcse which may *reduce* runtime performance in programs using computed gotos; -fgcse is actually run twice, because -frerun-cse-after-loop is also set at -O2. Would it be better to pass individual flags, such as -funroll-loops and -falign-loops=16 (ppc, Intel setting)? (3) I have been looking at how to implement a dual-constructor-in-a- pointer for Integer (i.e., merge constructors of small Integers and big Integers into the Int#). Would that solution be workable or might it break current Haskell programs? Just a thought. -Peter

Show replies by date

Simon Peyton-Jones

10 Aug 10 Aug

7:34 a.m.

Sounds good! Remember that the memory-allocation mechanism is crucial. How does BN do that? | (3) I have been looking at how to implement a dual-constructor-in-a- | pointer for Integer (i.e., merge constructors of small Integers and | big Integers into the Int#). Would that solution be workable or | might it break current Haskell programs? Just a thought. Making a single word contain either a pointer or a non-pointer (depending on the setting of some bit) would have some quite global implications, as would losing 32 bit precision from Int#. I suggest that you do not couple these two projects together! Do the GMP thing first, then investigate this later. We have quite a few bit-twidding ideas that we have never followed up. Simon | -----Original Message----- | From: Peter Tanski [mailto:p.tanski@gmail.com] | Sent: 10 August 2006 06:32 | To: Simon Peyton-Jones; Simon Marlow; Esa Ilari Vuokko; John Meacham | Cc: glasgow-haskell-users@haskell.org | Subject: Replacement for GMP: Update | | Simon PJ, Simon, Esa and John, | | Here is an update on what I have been doing so far in making a grand | attempt to replace GMP. | | (1) evaluate replacement libraries | LibTomMath: | Pros- | * has all the operators GMP offered | Cons- | * slow; about half as fast as OpenSSL in my tests for simple | mathematical operations, much slower if you account for time to | write or resize memory. (There is another MP library, which | LibTomMath is based on, that is also very slow--student work) | * not even reentrant; needs significant modification | * beta-release; needs a bit of work to get it to production level | * configuration needs to be written (not really portable, messy) | ARPREC: | Pros- | * very fast (essentially does everything through the | Floating Point Unit of a CPU) | * complex mathematical operations | * very precise | * already thread safe (through C++ thread-safe statics) | Cons- | * no bitwise operations (not even left and right-shifts) | * difficult configuration (everything runs by setting a precision | level; | ("precision level" ~= number of words (doubles) in array) | it does not automatically resize memory; conversion from | MP Real to Integer relies specially on careful precision-level) | * memory inefficient (underestimates the number of real digits you | can fit into a double, i.e., a 64-bit double has 48 bits of | precision, | holding about 9.6 digits per byte, resulting in an 848-byte array | to hold an MP number with 1000 digits). | OpenSSL: | Crypto++ (http://www.eskimo.com/~weidai/cryptlib.html): | Botan (http://botan.randombit.net/): | Pros- | * all of these are fast, since all use Integers to support | cryptography; | (Crypto++ and Botan are C++ crypto-libraries, all licenses good) | * all of these provide most basic mathematical operations | Cons- | * none of these provide &, |, ^(xor) bitwise operators | * Botan has least mathematical operators of the three | * none provide lcm operator | * all would realistically have to have the Integer libraries stripped | out of the distribution and repackaged for GHC | | Summary: I finally settled on modifying OpenSSL, since that would be | the easiest to work with under GHC's hood (plain C code, not C++). I | have to: | a. make OpenSSL's BN thread-safe (add thread local storage, at least) | b. optimally add configure-conditional parallelism to BN (through PVM) | c. add &, |, ^, lcm and a few other operations | d. separate the BN from the rest of OpenSSL and rename the symbols | to avoid conflicts (necessary because I have to modify the library | anyway) | | (2) work on GHC: | * finally understand C--; know what I need to modify | * work through Makefiles: touch and go; I haven't mapped out all the | variable settings from configure.in on down when it comes to DLLs | Comment: | for the Makefile in ghc/rts, in lines 300-346, | GC_HC_OPTS += -optc-O3 | --isn't this problematic? gcc, from -O2 on includes -fgcse which may | *reduce* runtime performance in programs using computed gotos; | -fgcse is actually run twice, because -frerun-cse-after-loop is also | set at -O2. Would it be better to pass individual flags, such as | -funroll-loops and -falign-loops=16 (ppc, Intel setting)? | | (3) I have been looking at how to implement a dual-constructor-in-a- | pointer for Integer (i.e., merge constructors of small Integers and | big Integers into the Int#). Would that solution be workable or | might it break current Haskell programs? Just a thought. | | -Peter | |

Peter Tanski

3:16 p.m.

...

Remember that the memory-allocation mechanism is crucial. How does BN do that?

BN uses a structure called "CTX"--OpenSSL calls all such structures "CTX"--to hold the local static variables for reentrancy. CTX structures do not affect memory allocation, though they *do* require malloc'd memory. For my purposes, the BN-CTX structure does give me an easy way to handle thread local storage. Otherwise, BN uses standard malloc'd memory. Creating a BN-MP (called a BIGNUM; really a struct), you either do: BIGNUM x; BN_init(&x); // defined in bn_lib.c; uses memset or, // uses OpenSSL-named checked malloc: // OPENSSL_malloc == CRYPTO_malloc BIGNUM* y = BN_new(); It would be easy to change the definition of OPENSSL_malloc to call RTS-memory as necessary. It would be more efficient for BN to be garbage collected (these bignum libraries allocate and delete a lot of small memory blocks (~1-2KB for large integers)). Since ForeignPtr is simply too slow I am bringing BN value allocations into the rts as close as possible to how you did it with GMP.

...

Making a single word contain either a pointer or a non-pointer (depending on the setting of some bit) would have some quite global implications, as would losing 32 bit precision from Int#. I suggest that you do not couple these two projects together! Do the GMP thing first, then investigate this later. We have quite a few bit-twidding ideas that we have never followed up.

The original idea was to create a specialized Int30# for that purpose. In any case implementing it would certainly make my intended job--getting this done in time for the next release--a bit more difficult. Best regards, Peter

Simon Marlow

9:58 a.m.

On 10 August 2006 06:32, Peter Tanski wrote:

...

for the Makefile in ghc/rts, in lines 300-346, GC_HC_OPTS += -optc-O3 --isn't this problematic? gcc, from -O2 on includes

-fgcse which

...

may *reduce* runtime performance in programs using

computed

...

gotos; -fgcse is actually run twice, because -frerun-cse-after-loop is also set at -O2. Would it

be better to

...

pass individual flags, such as -funroll-loops and -falign-loops=16 (ppc, Intel setting)?

Possibly, yes. IIRC, -O3 was mainly to get some loop unrolling. This is a performance-critical part of the system though, and when making any changes we like to measure things to make sure we haven't pessimised performance.

...

(3) I have been looking at how to implement a dual-constructor-in-a- pointer for Integer (i.e., merge constructors of small Integers and big Integers into the Int#). Would that solution be workable or might it break current Haskell programs? Just a thought.

Which representation in particular are you talking about here? Cheers, Simon

Peter Tanski

3:16 p.m.

Simon,

...

Possibly, yes. IIRC, -O3 was mainly to get some loop unrolling. This is a performance-critical part of the system though, and when making any changes we like to measure things to make sure we haven't pessimised performance.

I will try to test it both ways, then. Also, -optc-O3 is turned on even for debug builds since GC_HC_OPTS is set unconditionally. I could change it to: ifneq (,$(findstring $(SRC_HC_OPTS), -DDEBUG)) GC_HC_OPTS += -optc-O3 endif

...

...
(3) I have been looking at how to implement a dual-constructor-in-a- pointer for Integer (i.e., merge constructors of small Integers and big Integers into the Int#). Would that solution be workable or might it break current Haskell programs? Just a thought.

Which representation in particular are you talking about here?

I was talking about the definition of Integer in packages/base/GHC/ Num.lhs particularly the implementation in PrimOps.cmm, which returns Integers as: /* returns (# size :: Int#, data :: ByteArray# #) */ RET_NP(s,p); If the ByteArray were indicated through a pointer, the representation in Haskell would be J# IntegerT# {- new type, depending on pointer size, either holding 30 or 62 bits of precision -} and a Cmm procedure returning an Integer would be: RET_NP(p); I hope this doesn't confuse you. Best regards, Peter

Thorkil Naur

9:26 p.m.

Hello, On Thursday 10 August 2006 07:31, Peter Tanski wrote: ...

...

Summary: I finally settled on modifying OpenSSL, since that would be ...

Being a heavy user of Haskell Integers, I have followed this development with great interest. Although your decision has its drawbacks, it could very well be the best, all things considered. I would like to mention a few things that I have not seen discussed: Clearly, using an existing library unmodified would be preferable: New developments, error corrections, documentation, wide exposure, all of these things would be available. I have looked briefly at the OpenSSL Bignum library and in the areas of memory management, but also error handling, it seems clearly intertwined to some extent with OpenSSL in ways which would appear to rule out the direct use of this library for GHC Integers. However, considering the advantages of using an existing library unchanged, we might consider another possibility: Working with the OpenSSL people to modify their library to allow the sort of interfacing that is needed for its direct and efficient use in GHC. While, of course, retaining its value as part of OpenSSL. (And way further back: Have we tried to discuss the LGPL licence of GMP with the authors? I am not really into all these matters, sorry if this doesn't make sense.) Failing that, I would suggest considering the development of the modified library to a form that would allow independent use, apart from its use in GHC. This would add valuable possibilities to your options when choosing the precise mixture of Haskell and, perhaps, raw C code that best balances your performance desires and needs for convenience. I wish you the best of luck with your work. Regards Thorkil

Peter Tanski

11 Aug 11 Aug

midnight

Thorkil,

...

I would like to mention a few things that I have not seen discussed: Clearly, using an existing library unmodified would be preferable: New developments, error corrections, documentation, wide exposure, all of these things would be available.

Agreed. Unfortunately for us it seems that, apart from GMP, none of the fully developed and fast libraries available have all the functionality Haskell requires: ARPREC (and its predecessor, MPFUN) lack all binary operators (AND, OR, XOR, left-shift, right-shift, bit- flip, word-flip); OpenSSL's BN library lacks AND, OR, XOR and the least common multiple operation (lcm), as do the libraries in Botan and Crypto++. I did not mention this before, but they also lack conversion functions to and from floating point representations. So it would not be possible to use these in GHC and still maintain the same functionality without writing those transformation functions either separately (slow) or in the libraries themselves (modification). There are libraries that have the functionality we require: MIRACL (at http://indigo.ie/~mscott/) is free for educational use but requires a premium for commercial use; LibTomMath is essentially beta (version 0.38), slow (less than half as fast as OpenSSL, which is equivalent in speed to GMP) and from my experience when putting it together, a little hacked (after all, it is a beta version); MPI (from http://www.cs.dartmouth.edu/~sting/mpi/) is a good library and has all the math functions but lacks the binary operations and has not been updated since 2003; maybe it has not required updating (it was originally written in 1998). I have read that MPI is not as fast as LibTomMath (I will do a real comparison if anyone is interested). I do know from timing tests I did run in the MPI distribution that MPI tends to operate better on small integers (128-256 bits). LibTomMath and MPI are both essentially public domain. I have heard that LibTomMath is used as the Tcl bignum library; I do not know that library supports Perl's bignum (Math::BigInt) but I could look into it. Bignum libraries are actually fairly widely available but few are portable across a wide variety of domains: Microsoft's .NET bignum library and Apple's vecLib BigNum library, for example. (Apple's vecLib vBigNum is also limited since it offers big integers only at 256, 512 and 1024 bits and it does not provide all the mathematical operations we require.) You may have noticed from reading the discussions so far that an internal library *is* a possibility.

...

I have looked briefly at the OpenSSL Bignum library and in the areas of memory management, but also error handling, it seems clearly intertwined to some extent with OpenSSL in ways which would appear to rule out the direct use of this library for GHC Integers.

OpenSSL's BN library is primarily tuned to support cryptography, particularly the generation of very large primes for public key cryptosystems. It is possible to separate the BN library out (I am having some success there already). It is also possible to use the library separately from Haskell using ForeignPtr; essentially doing everything through Haskell's FFI. I have honestly not benchmarked a FFI-ForeignPtr interface against the current internal-GMP implementation, partly because the overhead required to use ForeignPtr and the availability of garbage-collected memory for GMP indicate that an internal GHC Bignum library would clearly be faster. Many people have given good arguments for using an FFI- interface to such a library (for one, GMP supports dll's and posix dynamic libraries, which would eliminate the licensing issue), so I think before I go on I will set up a benchmark to try the two out. The one thing I cannot do without taking GMP out of the GHC compiler is a side-to-side comparison of GMP-FFI and GMP-internal because GMP can use only one memory allocator at a time: either GHC's runtime system Garbage Collector or malloc.

...

However, considering the advantages of using an existing library unchanged, we might consider another possibility: Working with the OpenSSL people to modify their library to allow the sort of interfacing that is needed for its direct and efficient use in GHC. While, of course, retaining its value as part of OpenSSL.

I haven't looked at that for reasons of time--perhaps this is my personal target and perhaps it is for the benefit of the next GHC release on 26 August: it may take a long time to work with OpenSSL to modify their library. It might be worth looking into, just to cover all bases. The problem for me is that I seem to be doing all the work, albeit rather slowly; I simply do not have the time to follow every possibility.

...

(And way further back: Have we tried to discuss the LGPL licence of GMP with the authors? I am not really into all these matters, sorry if this doesn't make sense.)

That is actually a good idea; it has been suggested before for an individual developer. I doubt very much the Free Software Foundation would want to give an exception for one of their flagship products but I admit, that kind of thinking is, well, cowardly. As for whether I am the person to ask them, I could float the question but I could not represent the GHC Team.

...

Failing that, I would suggest considering the development of the modified library to a form that would allow independent use, apart from its use in GHC. This would add valuable possibilities to your options when choosing the precise mixture of Haskell and, perhaps, raw C code that best balances your performance desires and needs for convenience.

Before I actually bind any library with GHC, I will thoroughly test it as a stand-alone library and I will develop some good benchmarks for it as an FFI-based library (using ForeignPtr). The Creators of the GHC, however, have given good arguments why it is Very Good to use GHC's runtime system Garbage Collector to handle the Integer- memory for bignum libraries: it is almost certainly faster, it allows the runtime system's Scheduler to operate on Integers without resorting to the creation or management of Operating System (OS) threads and in general the evaluation of Integers is more a "native" than a "foreign" operation in Haskell programs. I hope I have answered a few of your points; some of them have given me more work but that is all right, I guess. You did say that you frequently use Integers in Haskell. If you have the time, would you be able to find any programs you might have that displayed poor Integer performance or possibly bottlenecks in the current system? It might be helpful for benchmarking the replacement--this is really supposed to be an "upgrade," after all. Best regards, Peter

Reilly Hayes

1:36 a.m.

There's one thing I don't entirely understand about the GMP problem. I understand that there are some limitations on GHC's ability to generate relocatable (and therefore dynamically linkable) code on x86 (a register allocation problem related to the mangler if I recall the comments in the code correctly). But this shouldn't prohibit linking GMP in dynamically, should it? It's just a C library and GCC should happily generate relocatable code. As a dynamically linked library, there should be no tainting issues to worry about even if the dynamically linked code is shipped with the executable. Am I missing something? On Aug 10, 2006, at 5:00 PM, Peter Tanski wrote:

...

Thorkil,

...
I would like to mention a few things that I have not seen discussed: Clearly, using an existing library unmodified would be preferable: New developments, error corrections, documentation, wide exposure, all of these things would be available.

Agreed. Unfortunately for us it seems that, apart from GMP, none of the fully developed and fast libraries available have all the functionality Haskell requires: ARPREC (and its predecessor, MPFUN) lack all binary operators (AND, OR, XOR, left-shift, right-shift, bit-flip, word-flip); OpenSSL's BN library lacks AND, OR, XOR and the least common multiple operation (lcm), as do the libraries in Botan and Crypto++. I did not mention this before, but they also lack conversion functions to and from floating point representations. So it would not be possible to use these in GHC and still maintain the same functionality without writing those transformation functions either separately (slow) or in the libraries themselves (modification).

There are libraries that have the functionality we require: MIRACL (at http://indigo.ie/~mscott/) is free for educational use but requires a premium for commercial use; LibTomMath is essentially beta (version 0.38), slow (less than half as fast as OpenSSL, which is equivalent in speed to GMP) and from my experience when putting it together, a little hacked (after all, it is a beta version); MPI (from http://www.cs.dartmouth.edu/~sting/mpi/) is a good library and has all the math functions but lacks the binary operations and has not been updated since 2003; maybe it has not required updating (it was originally written in 1998). I have read that MPI is not as fast as LibTomMath (I will do a real comparison if anyone is interested). I do know from timing tests I did run in the MPI distribution that MPI tends to operate better on small integers (128-256 bits). LibTomMath and MPI are both essentially public domain. I have heard that LibTomMath is used as the Tcl bignum library; I do not know that library supports Perl's bignum (Math::BigInt) but I could look into it.

Bignum libraries are actually fairly widely available but few are portable across a wide variety of domains: Microsoft's .NET bignum library and Apple's vecLib BigNum library, for example. (Apple's vecLib vBigNum is also limited since it offers big integers only at 256, 512 and 1024 bits and it does not provide all the mathematical operations we require.)

You may have noticed from reading the discussions so far that an internal library *is* a possibility.

...
I have looked briefly at the OpenSSL Bignum library and in the areas of memory management, but also error handling, it seems clearly intertwined to some extent with OpenSSL in ways which would appear to rule out the direct use of this library for GHC Integers.

OpenSSL's BN library is primarily tuned to support cryptography, particularly the generation of very large primes for public key cryptosystems. It is possible to separate the BN library out (I am having some success there already). It is also possible to use the library separately from Haskell using ForeignPtr; essentially doing everything through Haskell's FFI. I have honestly not benchmarked a FFI-ForeignPtr interface against the current internal-GMP implementation, partly because the overhead required to use ForeignPtr and the availability of garbage-collected memory for GMP indicate that an internal GHC Bignum library would clearly be faster. Many people have given good arguments for using an FFI- interface to such a library (for one, GMP supports dll's and posix dynamic libraries, which would eliminate the licensing issue), so I think before I go on I will set up a benchmark to try the two out. The one thing I cannot do without taking GMP out of the GHC compiler is a side-to-side comparison of GMP-FFI and GMP-internal because GMP can use only one memory allocator at a time: either GHC's runtime system Garbage Collector or malloc.

...
However, considering the advantages of using an existing library unchanged, we might consider another possibility: Working with the OpenSSL people to modify their library to allow the sort of interfacing that is needed for its direct and efficient use in GHC. While, of course, retaining its value as part of OpenSSL.

I haven't looked at that for reasons of time--perhaps this is my personal target and perhaps it is for the benefit of the next GHC release on 26 August: it may take a long time to work with OpenSSL to modify their library. It might be worth looking into, just to cover all bases. The problem for me is that I seem to be doing all the work, albeit rather slowly; I simply do not have the time to follow every possibility.

...
(And way further back: Have we tried to discuss the LGPL licence of GMP with the authors? I am not really into all these matters, sorry if this doesn't make sense.)

That is actually a good idea; it has been suggested before for an individual developer. I doubt very much the Free Software Foundation would want to give an exception for one of their flagship products but I admit, that kind of thinking is, well, cowardly. As for whether I am the person to ask them, I could float the question but I could not represent the GHC Team.

...
Failing that, I would suggest considering the development of the modified library to a form that would allow independent use, apart from its use in GHC. This would add valuable possibilities to your options when choosing the precise mixture of Haskell and, perhaps, raw C code that best balances your performance desires and needs for convenience.

Before I actually bind any library with GHC, I will thoroughly test it as a stand-alone library and I will develop some good benchmarks for it as an FFI-based library (using ForeignPtr). The Creators of the GHC, however, have given good arguments why it is Very Good to use GHC's runtime system Garbage Collector to handle the Integer- memory for bignum libraries: it is almost certainly faster, it allows the runtime system's Scheduler to operate on Integers without resorting to the creation or management of Operating System (OS) threads and in general the evaluation of Integers is more a "native" than a "foreign" operation in Haskell programs.

I hope I have answered a few of your points; some of them have given me more work but that is all right, I guess. You did say that you frequently use Integers in Haskell. If you have the time, would you be able to find any programs you might have that displayed poor Integer performance or possibly bottlenecks in the current system? It might be helpful for benchmarking the replacement--this is really supposed to be an "upgrade," after all.

Best regards, Peter

_______________________________________________ Glasgow-haskell-users mailing list Glasgow-haskell-users@haskell.org http://www.haskell.org/mailman/listinfo/glasgow-haskell-users

Reilly Hayes rfh@reillyhayes.com

Alec Berryman

2:15 a.m.

Reilly Hayes on 2006-08-10 18:36:49 -0700:

...

There's one thing I don't entirely understand about the GMP problem. I understand that there are some limitations on GHC's ability to generate relocatable (and therefore dynamically linkable) code on x86 (a register allocation problem related to the mangler if I recall the comments in the code correctly). But this shouldn't prohibit linking GMP in dynamically, should it? It's just a C library and GCC should happily generate relocatable code. As a dynamically linked library, there should be no tainting issues to worry about even if the dynamically linked code is shipped with the executable.

Am I missing something?

No, the LGPL doesn't mandate source redistribution when you redistribute a binary that is dynamically linked to a LGPL-licensed library. If GHC did support dynamically linking programs, it wouldn't be an issue, but GHC only supports that on OS X. I was wondering something similar - is it really easier to replace the functionality and speed of GMP than to extend GHC's dynamic library support to other platforms?

Reilly Hayes

3:15 a.m.

I have a Mac as well, but it is an intel one. GHC does NOT support dynamic linking of Haskell libraries on intel macs, but C libraries like readline & GMP dynamically link just fine. For example: $ otool -L /usr/local/ghc/lib/ghc-6.5/ghc-6.5 /usr/local/ghc/lib/ghc-6.5/ghc-6.5: /usr/local/lib/libreadline.5.0.dylib (compatibility version 5.0.0, current version 5.0.0) /usr/lib/libSystem.B.dylib (compatibility version 1.0.0, current version 88.1.3) GMP.framework/Versions/A/GMP (compatibility version 7.0.0, current version 7.0.0) On Aug 10, 2006, at 7:15 PM, Alec Berryman wrote:

...

Reilly Hayes on 2006-08-10 18:36:49 -0700:

...
There's one thing I don't entirely understand about the GMP problem. I understand that there are some limitations on GHC's ability to generate relocatable (and therefore dynamically linkable) code on x86 (a register allocation problem related to the mangler if I recall the comments in the code correctly). But this shouldn't prohibit linking GMP in dynamically, should it? It's just a C library and GCC should happily generate relocatable code. As a dynamically linked library, there should be no tainting issues to worry about even if the dynamically linked code is shipped with the executable.

Am I missing something?

No, the LGPL doesn't mandate source redistribution when you redistribute a binary that is dynamically linked to a LGPL-licensed library. If GHC did support dynamically linking programs, it wouldn't be an issue, but GHC only supports that on OS X.

I was wondering something similar - is it really easier to replace the functionality and speed of GMP than to extend GHC's dynamic library support to other platforms?

_______________________________________________ Glasgow-haskell-users mailing list Glasgow-haskell-users@haskell.org http://www.haskell.org/mailman/listinfo/glasgow-haskell-users

Reilly Hayes rfh@reillyhayes.com

Bulat Ziganshin

5:34 a.m.

New subject: Re[2]: Replacement for GMP: Update

Hello Peter, Friday, August 11, 2006, 4:00:40 AM, you wrote:

...

OpenSSL's BN library is primarily tuned to support cryptography, particularly the generation of very large primes for public key cryptosystems. It is possible to separate the BN library out (I am having some success there already). It is also possible to use the library separately from Haskell using ForeignPtr; essentially doing everything through Haskell's FFI. I have honestly not benchmarked a FFI-ForeignPtr interface against the current internal-GMP implementation, partly because the overhead required to use ForeignPtr and the availability of garbage-collected memory for GMP indicate that an internal GHC Bignum library would clearly be faster.

why you say that ForeignPtr is slow? afaik, malloc/free is slow, but starting from ghc 6.6 speed of _using_ ForeignPtr is the same as for Ptr -- Best regards, Bulat mailto:Bulat.Ziganshin@gmail.com

skaller

8:20 a.m.

New subject: Re[2]: Replacement for GMP: Update

On Fri, 2006-08-11 at 09:34 +0400, Bulat Ziganshin wrote:

...

why you say that ForeignPtr is slow? afaik, malloc/free is slow, but starting from ghc 6.6 speed of _using_ ForeignPtr is the same as for Ptr

Er, malloc/free is not slow .. its very fast. I implemented an arena based allocator (one which just increments a pointer) and it was slower than malloc .. ok so my implementation was naive, but still, that does make malloc pretty good. After all on the average call where an object of that size is free already it is a single array lookup, we have: (a) fetch pointer (one read) (b) fetch next (one read) (c) store next as current (one write) It is very hard to beat that. Indeed, the whole cost of this technique is probably in the mutex based locking that wraps it on a multi-processor. A purely functional system -- one which does NOT convert self tail calls into jumps and reuse storage -- can perhaps be faster, since each thread can have its own local arena to allocate from (without need of any write barrier) .. however it isn't clear to me what the trade off is between cache misses and parallelism. Doesn't Haskell do that optimisation? It is of course hard to test this today without a fairly expensive box with enough CPUs on the one bus. -- John Skaller <skaller at users dot sf dot net> Felix, successor to C++: http://felix.sf.net

Simon Peyton-Jones

9:25 a.m.

New subject: Re[2]: Replacement for GMP: Update

The issue isn't that malloc is slow, but rather that we don't know when to call 'free'. To make sure that we do call free, we must make GHC's garbage collector remember when it drops the reference to the object, and then call free. That is what Peter refers to as the ForeignPtr solution. It's gotten a lot cheaper in GHC 6.6 than it used to be, so it's worth trying. A merit of the approach is that is avoids fiddling with the bignum allocator at all. Simon | -----Original Message----- | From: glasgow-haskell-users-bounces@haskell.org [mailto:glasgow-haskell-users-bounces@haskell.org] | On Behalf Of skaller | Sent: 11 August 2006 09:21 | To: Bulat Ziganshin | Cc: Peter Tanski; glasgow-haskell-users@haskell.org | Subject: Re: Re[2]: Replacement for GMP: Update | | On Fri, 2006-08-11 at 09:34 +0400, Bulat Ziganshin wrote: | | > why you say that ForeignPtr is slow? afaik, malloc/free is slow, but | > starting from ghc 6.6 speed of _using_ ForeignPtr is the same as for Ptr | | Er, malloc/free is not slow .. its very fast. I implemented | an arena based allocator (one which just increments a pointer) | and it was slower than malloc .. ok so my implementation was | naive, but still, that does make malloc pretty good. | | After all on the average call where an object of that | size is free already it is a single array lookup, we have: | | (a) fetch pointer (one read) | (b) fetch next (one read) | (c) store next as current (one write) | | It is very hard to beat that. Indeed, the whole cost | of this technique is probably in the mutex | based locking that wraps it on a multi-processor. | | A purely functional system -- one which does NOT convert | self tail calls into jumps and reuse storage -- can | perhaps be faster, since each thread can have its own | local arena to allocate from (without need of any write | barrier) .. however it isn't clear to me what the trade | off is between cache misses and parallelism. | | Doesn't Haskell do that optimisation? | | It is of course hard to test this today without a | fairly expensive box with enough CPUs on the one bus. | | -- | John Skaller <skaller at users dot sf dot net> | Felix, successor to C++: http://felix.sf.net | | _______________________________________________ | Glasgow-haskell-users mailing list | Glasgow-haskell-users@haskell.org | http://www.haskell.org/mailman/listinfo/glasgow-haskell-users

Peter Tanski

8:26 p.m.

New subject: Re[2]: Replacement for GMP: Update

Simon PJ and Bulat,

...

[the] ForeignPtr solution [has] gotten a lot cheaper in GHC 6.6 than it used to be, so it's worth trying. A merit of the approach is that is avoids fiddling with the bignum allocator at all.

I actually did not know that until today; I have tried to keep up with the rapid changes going on but until Simon Marlow posted the FFI syntax patch on cvs-ghc-request I had not read into it that much. It won't be too much trouble for me to do a bare FFI binding to GMP or another library (people seem to be having problems with OpenSSL's license) but what I have been doing still applies: writing bitwise operators and cleaning things up. I don't know how much the indirection of a FFI binding would degrade the speed compared to a direct C-- binding (you have an extra function call with FFI); it should not be any more costly than indirection through two pointers. This will be quick: I will set up a GMP-FFI binding as a speed- reference, for starters. Best regards, Peter

Peter Tanski

10:56 p.m.

New subject: Re[2]: Replacement for GMP: Update

John,

...

After all on the average call where an object of that size is free already it is a single array lookup, we have:

(a) fetch pointer (one read) (b) fetch next (one read) (c) store next as current (one write)

This is true for memory access; it is not true for memory allocation. I do not know how malloc allocates memory on Windows but on general POSIX systems the kernel uses a linked list and lots of other management things to reduce fragmentation, such KMEMSTAT. Malloc may also block, which is something that you have more control over in your own garbage collected heap. A really good explanation for the problem of rapidly allocating and deallocating temporary blocks of memory under 35kb is here: http://ridiculousfish.com/blog/ archives/2006/05/16/36/ . In any case, Simon Marlow had previously mentioned that alloc (from GHC's heap) is faster than malloc. He is almost certainly correct, although I hope the difference will not be that great and the only thing I have to worry about is ForeignPtr. We shall see whether malloc-memory makes a difference in the benchmarks.

...

A purely functional system -- one which does NOT convert self tail calls into jumps and reuse storage -- can perhaps be faster, since each thread can have its own local arena to allocate from (without need of any write barrier) .. however it isn't clear to me what the trade off is between cache misses and parallelism.

That is interesting but I do not understand whether your mention of self tail calls turned into jumps was low or high level. From the context it seems as if you are talking about a high level implementation; each function running in a separate thread. GHC's RTS does use many separate threads (the RTS is threaded by default for the latest version, 6.6). As for turning self tail calls into jumps at the low level, GHC does do this through C-- (the GHC implementation of C-- is called Cmm). I believe that is both faster and more memory efficient than a high level threaded system. Philosophically speaking, even if Simon Peyton-Jones developed Cmm to solve the problem of efficient functional computations, Cmm has turned the research of Haskell from research on a computer language to research on a system of computation. (Maybe that is what he meant when some time ago he wrote John Meacham and said that they (the GHC researchers) considered compiling via C a dead end.) A language can be anything: all it requires is a means of being translated into machine code; pseudo-intelligent and advanced compiler systems such as gcc, JHC for Haskell or OCaml for the Caml version of ML, may translate programming languages into machine code but the underlying computation remains largely sequential. The curious thing about GHC- Haskell is that through the prism of Cmm, which enforces such things as immutable variables and recursion right at the machine level, Haskell is less a language of translation to sequential machine code and more a description of a computational model. If you still think I am wrong about this, consider the possibility that Haskell with Cmm is a modern research project in the same concept that motivated Lisp: a different model of computation. Best regards, Peter

skaller

12 Aug 12 Aug

3:06 a.m.

New subject: Re[2]: Replacement for GMP: Update

On Fri, 2006-08-11 at 18:56 -0400, Peter Tanski wrote:

...

John,

...
After all on the average call where an object of that size is free already it is a single array lookup, we have:

(a) fetch pointer (one read) (b) fetch next (one read) (c) store next as current (one write)

This is true for memory access; it is not true for memory allocation.

That's an *allocation* obtained by popping a suitable block off a linked list (sorry should have explained better)

...

I do not know how malloc allocates memory on Windows but on general POSIX systems the kernel uses a linked list and lots of other management things to reduce fragmentation, such KMEMSTAT.

Yes, but most allocations don't require a kernel call: malloc is implemented in libc and suballocates an block obtained from the kernel. Free'd blocks go on a free list indexed by block size .. so the typical program will get to a state where most allocations reduce to popping a list.

...

Malloc may also block, which is something that you have more control over in your own garbage collected heap.

Yes.

...

A really good explanation for the problem of rapidly allocating and deallocating temporary blocks of memory under 35kb is here: http://ridiculousfish.com/blog/ archives/2006/05/16/36/ .

Eh? That explains why OSX is slower with 35K blocks because it switches to kernel calls for larger allocations at that size .. the article doesn't say anything about smaller blocks. I assume we're considering small blocks here: list nodes and so on are small, and these would be the majority case in a functional programming language.

...

...
A purely functional system -- one which does NOT convert self tail calls into jumps and reuse storage -- can perhaps be faster, since each thread can have its own local arena to allocate from (without need of any write barrier) .. however it isn't clear to me what the trade off is between cache misses and parallelism.

That is interesting but I do not understand whether your mention of self tail calls turned into jumps was low or high level.

What I mean is the optimisation (which Felix does for functional code) whereby you convert: int f(int x) { int y = x + x; if(x>100) return x; return f(y); } to int f(int x) { start: int y = x + x; if (x>100) return x; x = y; // assign argument goto start; } This optimisation eliminates the need to spawn a new stack frame, and will eliminate cache misses. So it looks very efficient. My point here is that actually this is a disastrous optimisation in a multi-processing environment, because in general, the assignment of a pointer means the store isn't write once. With write once store -- ignoring registers -- you can collect lazily in a background thread. Although you're not scanning all of memory .. new memory is allocated as time goes on .. it doesn't matter. Any garbage you find is certain to be garbage. This is because you cannot miss a heap block in your sweep. Even though new blocks are being allocated in parallel, they can only obtain a pointer from the old picture of the heap (implying the block is reachable), and, the pointer in the old block from which it was obtained cannot be erased: since constructed blocks are read only. I haven't figured all the details. I am imagining a generational copying collector with a separate young heap for every thread, whilst the great grand mother heap is collected in the background. This means (a) mother collection, (b) young heap compaction both work on a per thread basis independently of other threads and without any synchronisation or locking. What requires synchronisation is aging. Moving objects out of the child heaps into a parent needs to be serialised, and the start of a new sweep of the parent heap must twiddle a lock or something to ensure it has a coherent snapshot of the heap at some point of this serialisation (however the picture of the heap only needs to be synchronised once). for example if you have a linked list of the mother heap blocks, you have to serialise pushing stuff onto it during ageing .. and you have to get a valid pointer to start the sweep from. The point is, this scheme reduces the need for serialisation to the infrequent need to age objects or start a new sweep of the mother heap: the actual sweep, raw allocations, and even young heap compaction, all occur in separate threads and don't interfere with each other. Thus, it seems a lot more scalable to a multi-processing environment to actually have LOW LEVEL purely functional paradigm: no store can be written more than once. More precisely, pointers are initialised to NULL, and be assigned a non-NULL heap pointer once, but non-NULL pointer can never be erased. Which means: converting tail calls to a loop and using mutable store is a nono: it is faster on a single CPU but will actually be slower on a multi-processor because you'd be forced to use an inferior garbage collection scheme. Felix uses an inferior scheme because it uses the C/C++ object model by specification. I have to stop ALL threads to collect. This is REALLY BAD: if you have N processors, N-1 of them can wait for a long time for the last one to stop! [Felix cannot signal the threads to stop. It has to wait until they reach a yield point]

...

(Maybe that is what he meant when some time ago he wrote John Meacham and said that they (the GHC researchers) considered compiling via C a dead end.)

Yes. I tend to agree. C is really very dead, it lacks really basic serial primitives, and it uses a stack model: it isn't really any good even for C :) Unfortunately, hardware design tends to supply stuff that runs C code fast ;(

...

The curious thing about GHC- Haskell is that through the prism of Cmm, which enforces such things as immutable variables and recursion right at the machine level, Haskell is less a language of translation to sequential machine code and more a description of a computational model. If you still think I am wrong about this,

I don't. I think you're right. Purely functional storage model seems to scale better to multiple threads. -- John Skaller <skaller at users dot sf dot net> Felix, successor to C++: http://felix.sf.net

Bulat Ziganshin

6:58 a.m.

New subject: Re[4]: Replacement for GMP: Update

Hello skaller, Saturday, August 12, 2006, 7:06:15 AM, you wrote:

...

My point here is that actually this is a disastrous optimisation in a multi-processing environment, because in general, the assignment of a pointer means the store isn't write once.

:) all the variables rewritten is local to the function. _without_ tail call optimization, we create new stack frame for each recursive call. _with_ optimization, we just update vars in the first stack frame created because we know that these vars will not be reused after return from call

...

I haven't figured all the details. I am imagining a generational copying collector with a separate young heap for every thread, whilst the great grand mother heap is collected in the background.

it's close to the GHC scheme - 2-generational GC with 1st generation local to thread and having default size of 256kb (tied to the size of L2 caches of modern CPUs) while 2nd generation is global for all threads and collected at moments when memory currently allocated from OS exhausted. but currently 2nd-generation GCs don't runs in background and can be as long as several seconds. it's definitely hard to write games (or even web servers) with such architecture! :) so memory allocation is very fast. 1st-level GC is also fast because it runs inside L2 cache. access to "local variables" (i.e. data structures allocated in near past) is fast because they are all inside L2 cache. although Haskell don't supports "auto" variables which don't need GC and freed automatically, this scheme allows to significantly lower cost of managing short-lived data structures -- Best regards, Bulat mailto:Bulat.Ziganshin@gmail.com

skaller

8:34 a.m.

New subject: Re[4]: Replacement for GMP: Update

On Sat, 2006-08-12 at 10:58 +0400, Bulat Ziganshin wrote:

...

Hello skaller,

Saturday, August 12, 2006, 7:06:15 AM, you wrote:

...
My point here is that actually this is a disastrous optimisation in a multi-processing environment, because in general, the assignment of a pointer means the store isn't write once.

:) all the variables rewritten is local to the function. _without_ tail call optimization, we create new stack frame for each recursive call. _with_ optimization, we just update vars in the first stack frame created because we know that these vars will not be reused after return from call

Yes, but this defeats the use of the kind of collector I attempted to describe. The problem will occur if the 'stack' is aged: in that case the sweep can miss the mutation and reap a live object. -- John Skaller <skaller at users dot sf dot net> Felix, successor to C++: http://felix.sf.net

Bulat Ziganshin

12:58 p.m.

New subject: Re[6]: Replacement for GMP: Update

Hello skaller, Saturday, August 12, 2006, 12:34:54 PM, you wrote:

...

...
...
My point here is that actually this is a disastrous optimisation in a multi-processing environment, because in general, the assignment of a pointer means the store isn't write once.

:) all the variables rewritten is local to the function. _without_ tail call optimization, we create new stack frame for each recursive call. _with_ optimization, we just update vars in the first stack frame created because we know that these vars will not be reused after return from call

...

Yes, but this defeats the use of the kind of collector I attempted to describe.

...

The problem will occur if the 'stack' is aged: in that case the sweep can miss the mutation and reap a live object.

well, i don't know how updatable vars in stack interoperate with GC. let's someone else answer this question. but afaik GHC 6.5 supports multi-cpu execution of multiple Haskell threads (which makes it a better language for modern multi-core cpus), tail-call optimization and 2-generation GC (actually, it supports any number of generations). also, GHC supports mutable arrays. you search GHC bug tickets for one that optimized GC with mutable vars: in ghc64 _each_ GC, including minor ones need to scan _every_ updatable reference (IORef/STRef) and _each_ element of _every_ updatable array (IOArray/STArray) and this significantly slowdowns many programs, including GHC itself. in ghc65 better scheme now used -- Best regards, Bulat mailto:Bulat.Ziganshin@gmail.com

skaller

13 Aug 13 Aug

12:34 a.m.

New subject: Re[6]: Replacement for GMP: Update

On Sat, 2006-08-12 at 16:58 +0400, Bulat Ziganshin wrote:

...

Hello skaller,

hi .. :)

...

...
The problem will occur if the 'stack' is aged: in that case the sweep can miss the mutation and reap a live object.

well, i don't know how updatable vars in stack interoperate with GC. let's someone else answer this question.

I know very little about Haskell, let alone GHC internals (I'm mainly using Ocaml at the moment, to implement my own programming language). Haskell seems to be on the right track in many ways: I view the desirable upgrade path for programmers as roughly: C --> C++ --> Felix --> ML --> Haskell which represents a gradual move towards a more functional and declarative style. It is not just that this is 'easier to reason about' for the human programmer, and thus provide better ways of ensuring correctness, but also that it is 'easier to reason about' for machine algorithms, and thus provides for better optimisation and much faster code. The move to multi-processing on desktop computers seems to accelerate the need for better languages than C++ (don't even mention that J*** language .. :)

...

but afaik GHC 6.5 supports multi-cpu execution of multiple Haskell threads (which makes it a better language for modern multi-core cpus), tail-call optimization and 2-generation GC (actually, it supports any number of generations). also, GHC supports mutable arrays. you search GHC bug tickets for one that optimized GC with mutable vars: in ghc64 _each_ GC, including minor ones need to scan _every_ updatable reference (IORef/STRef) and _each_ element of _every_ updatable array (IOArray/STArray) and this significantly slowdowns many programs, including GHC itself. in ghc65 better scheme now used

Cool. I guess what I'm saying is: there seems to be a contradiction between multi-processing and mutation, perhaps more specifically shared state. The tail-rec and array optimisation stuff may map functional code to a procedural model, and obtain good performance on a uni-processor.. but perhaps it does so at the cost of bad performance on a multi-processor? It may turn out that the underlying reason for preferring a lazy purely functional high level programming language is also a reason for a similar low level execution model. In particular .. it seems to me it defeats high performance garbage collection of the very kind Haskell already uses. The issue isn't making the collector faster .. but being able to run it asynchronously. We all surely conceive the functional model as constructive: you make new objects as you need them, but never change old ones. This is basically how 'mathematics' works. You don't worry about objects you don't need any more. Since we don't have infinite memory on real computers, we use a garbage collector to recycle the unreachable store. And we all 'conceive' that as a transparent background process. Clearly, we need not just an asynchronous reaper -- we also need to be able to execute multiple reapers in parallel, otherwise the system will not scale to a large number of CPUs. But the state of the art is then two stages behind the requirement: Haskell still has to 'world stop' threads to do a major collection. My suggestion that the cause of this is precisely that it is possible to age the independent per thread young heaps so that the aged objects, now in the shared major heap, are mutable. To prevent the mutable parts changing during collection, all the threads have to be stopped. Do I understand correctly this is what actually happens at the moment? To fix the problem, we have to prevent the mutable store ever getting into the major heap. That can be done by throwing out the tail-rec, array, and other optimisations that 'cheat' the collector by manually reusing store, and bypassing the collection. That works for a single thread only because you can't collect and spew objects at the same time, ensuring synchronisation. So I'm bringing into question whether these nice 'optimisations' are actually worthwhile. They actually seem to *degrade* performance, not improve it, when we're running with a large number of CPUs. Stopping the world if you have 20,000 CPU's will happen so often, all the CPU's will be idle 99.99% of the time :) Removing the optimisation will slow down each CPU, but remove any need to stop the world for long periods. The world WILL have to pause to serialise aging the young heaps. It would be even better if the collection space could somehow be partitioned so you could dedicate say 1% of the CPU's to collection, that is, support multiple parallel background collection: that is mandatory for scalability. I note in passing that because the problem is with *aging* mutable store .. the problem also goes away if it is not aged. In particular if all mutable objects are forcibly retained in the young heap, there's no problem. For tail-rec optimisation this is probably viable, and the implementation is possibly quite simple, at least in principle: allow the machine stack as an 'additional' generation, with the special property it only permits FILO allocation in the usual way. This means (a) no copying into the major heap and (b) no compaction (unnecessary). If the system supported this, there should be no problem using the stack for a couple of mutable variables needed for a tail-rec optimisation. [Array might be a different story, don't know] -- John Skaller <skaller at users dot sf dot net> Felix, successor to C++: http://felix.sf.net

Bulat Ziganshin

14 Aug 14 Aug

4:31 p.m.

New subject: Re[8]: Replacement for GMP: Update

Hello skaller, Sunday, August 13, 2006, 4:34:14 AM, you wrote:

...

I know very little about Haskell, let alone GHC internals

me too. so it's better to wait for comments about your thoughts from GHC team than from me. but at least i can said that

...

But the state of the art is then two stages behind the requirement: Haskell still has to 'world stop' threads to do a major collection.

is not exactly true. look at "Non-stop Haskell" (http://www.haskell.org/~simonmar/papers/nonstop.pdf) i don't know why it is not included in 6.6 or previous version

...

So I'm bringing into question whether these nice 'optimisations' are actually worthwhile. They actually seem to *degrade* performance, not improve it, when we're running with a large number of CPUs. Stopping the world if you have 20,000 CPU's will happen so often, all the CPU's will be idle 99.99% of the time :)

btw, one GHC intern worked on multi-processor GC and i hope that it will be included in 6.6. so, the GC will also use all these 20k cpus :) or Intel/IBM/Sun will start make some FP chips. they already do this actually, just these chips still emulate x86/sparc/... ones :) -- Best regards, Bulat mailto:Bulat.Ziganshin@gmail.com

Simon Marlow

21 Aug 21 Aug

2:55 p.m.

Bulat Ziganshin wrote:

...

Hello skaller,

Sunday, August 13, 2006, 4:34:14 AM, you wrote:

...
But the state of the art is then two stages behind the requirement: Haskell still has to 'world stop' threads to do a major collection.

is not exactly true. look at "Non-stop Haskell" (http://www.haskell.org/~simonmar/papers/nonstop.pdf)

i don't know why it is not included in 6.6 or previous version

Simply because it adds overhead (both in runtime and code complexity), and the benefits are relatively modest. To comment on another point in this thread: currently the generation 0 GCs (minor GCs) also stop-the-world, even on a multiprocessor. This is a significant problem, and we have some ideas for fixing it, but no code yet (it's a possible intern project, if anyone is interested). Parallel major GC was worked on by Roshan James during his internship here this summer, and we have some working code, but it needs a lot more testing and tuning, and it may be some time before we can incorporate this into the mainline. Cheers, Simon

Bulat Ziganshin

3:20 p.m.

New subject: Re[2]: Replacement for GMP: Update

Hello Simon, Monday, August 21, 2006, 6:55:47 PM, you wrote:

...

...
is not exactly true. look at "Non-stop Haskell"

...

Simply because it adds overhead (both in runtime and code complexity), and the benefits are relatively modest.

i think that GC that don't stops the world is just a different product. even if it makes program, say, 2 times slower overall. just imagine game like Quake written in GHC ;) of course, i understand that you don't want to support one more GC architecture, just mentioned why it may matter for some people

...

Parallel major GC was worked on by Roshan James during his internship here this summer, and we have some working code, but it needs a lot more testing and tuning, and it may be some time before we can incorporate this into the mainline.

so it seems that it will not be included in 6.6.* ? -- Best regards, Bulat mailto:Bulat.Ziganshin@gmail.com

Simon Marlow

22 Aug 22 Aug

7:43 a.m.

New subject: Re[2]: Replacement for GMP: Update

On 21 August 2006 16:20, Bulat Ziganshin wrote:

...

Hello Simon,

Monday, August 21, 2006, 6:55:47 PM, you wrote:

...
Parallel major GC was worked on by Roshan James during his internship here this summer, and we have some working code, but it needs a lot more testing and tuning, and it may be some time before we can incorporate this into the mainline.

so it seems that it will not be included in 6.6.* ?

Unfortunately, no. I hope to work on it after the 6.6 release. Cheers, Simon

Thorkil Naur

24 Aug 24 Aug

6:39 p.m.

Hello Peter, Sorry for the late reply. From your latest communication which seems to be Date: Sat, 12 Aug 2006 21:12:05 -0400 From: Peter Tanski Subject: Re: OpenSSL License (was Replacement for GMP: Update) To: John Goerzen I am a bit uncertain where the matter stands right now. Nevertheless, I wish to thank you for your reply. Clearly, you have done a lot of work, not least to investigate alternatives. You write

...

I hope I have answered a few of your points; some of them have given me more work but that is all right, I guess. ...

And you most certainly have, much more detailed than I could have hoped for. And hopefully, this additional work that you mention is not something that you considered a waste of time. I am also most happy to read that

...

Before I actually bind any library with GHC, I will thoroughly test it as a stand-alone library ...

since "correctness" of Integer computation is by far its most important property. (I would say that "performance" comes second, then what could be termed "interoperability" as third.) The support for Integer computation is a messy subject: On the one hand, even children out of the first few years of school are capable of understanding the basic functionality involved. On the other hand, providing efficient support of this functionality, especially if required in a wide selection of circumstances, requires a surprising amount of hard work and insight. I would say, if only really large integers were routinely required in actual, real-world computations, our CPUs would support these computations and there would be no need to consider the question in the, admittedly, fairly limited context of GHC. The situation seems to be that the functionality is perhaps considered obviously available, but the reality is that it can only be attained at significant cost (either real money or honest sweat). I am truly unable to tell what I would consider the ideal situation. On the one hand, I value greatly the freedom of choosing my circumstances without restraints. And I appreciate the desire of noble people like the GHC developers to be equally free of such restraints. This would point in the direction of basing Integer computations on a fairly generic implementation. Which insightful consideration would then force us to admit would probably cost a factor of, say, 2 or more in performance in some cases. On the other hand, the existence of libraries like GMP seems to offer an economic path out of this jungle. However, as the discussion over the past couple of weeks illustrates, the path may be unacceptably thorny. Let me quote some additional statements from this debate: On Thursday 10 August 2006 19:00, Simon Peyton-Jones wrote: ...

...

OK so you are talking of taking a copy of the BN code, changing the function names, and statically linking its allocator to GHC's RTS.

If someone wants to make a program that uses BN for some other purpose, they get the original BN, whose function names don't clash, and which uses malloc.

Sounds ok to me.

Simon ...

Two problems seem to be lurking here: The first is that, although taking a copy of some existing library and massaging it to become working under some presently available circumstances may be fine, what is really needed is something that keeps on working, even under changing and future circumstances. The second is that if I wish to mix Haskell and non-Haskell code that processes the same data, I may find myself in need of a way to convert between whatever is required by this copy of some existing library that has been created and the presently available, but potentially quite different, version of that same library. No offence is meant here: I know that I am simply re-iterating what I have said earlier. However, I believe that this is sufficiently important to risk being ridiculed for pointing out the obvious. Then you ask

...

... If you have the time, would you be able to find any programs you might have that displayed poor Integer performance or possibly bottlenecks in the current system?

Let me say first that I have not experienced any really surprisingly poor performance of the current system. To be sure, I have seen programs similar to my own and not written in Haskell that performed better on comparable tasks. But I have not been able to say specifically that this poorer performance of my program was caused by some inadequacy in the Haskell (with GMP) implementtion. I better be specific here: What I do is integer factorization (splitting integers into their prime factors). And, for example, I have implemented a version of the ECM (Elliptic Curve Method) in Haskell. Having done that (independently and mainly to learn what it was about), I eventually compared it in performance to GMP-ECM which Paul Zimmermann wrote. And I was initially disappointed to find that GMP-ECM was about 7 times faster than my ECM code. Fortunately, various sessions of "hand-waving" brought this factor down to about 1.5. So, whenever GMP-ECM used 2 minutes, my program would use 3 minutes. I must admit that I am neither satisfied nor done with this result. At the outset, when I started using Haskell for these number theoretic computations, I had hoped that the overhead of using Haskell, compared to the use of C with GMP calls, would be insignificant, something like 10 or at worst 20 percent. So this 50 percent of overhead seems a bit excessive. On the other hand, there may be additional hands to wave that I haven't thought of. And again: I am not done working with this, just holding an extended break. Please indicate if you wish to see some detailed and hard results. Some additional experience is this: To gain some potentially easier-to-compare measurements between Haskell-with-GMP and C-with-GMP, I implemented the GMP function mpz_powm (raising a number to a power modulo a third number) as a primitive operation in GHC. And compared it to my own power-raising operation that was implemented in Haskell. Overall, the GMP implementation of the power operation was 2 times faster than my Haskell implememntation. But, to make sure, using some techniques that my code certainly didn't use. An interesting thing happened, however, for very large numbers: The GMP power operation started using a lot of memory, thrashing, and then, eventually, with sufficiently large operands, crashed for lack of memory. Where my Haskell-implemented version continued to provide usable results. The explanation of this should undoubtably be found in the method (some FFT-thing) used for the power operation by GMP for very large numbers: My guess is that GMP when using this method issues a large number of allocate/free requests. And, again I am guessing, assuming that each allocate request causes the GHC runtime to allocate, whereas each free only causes a "we'll handle this at the next GC", then such large computations may very well display such behaviour. An example where the "Very Good Thing" of using the GHC RTS memory management could be questioned. I will not bother you further, except for expressing my sincere respect for the work that you are doing. Best regards Thorkil

Peter Tanski

1 Sep 1 Sep

4:43 a.m.

Hello Thorkil, I am very sorry for the late reply. I have been extremely busy and I wanted to give you a coherent answer. For a brief overview of the speed of the libraries I looked at carefully, see http://hackage.haskell.org/trac/ghc/wiki/ReplacingGMPNotes (I added a few charts to show the speed of ARPREC, OpenSSL's BN, GMP and LibTomMath. I did not add speed tests for Crypto++ and Botan because they don't measure up. The original timings I obtained for them were based on their own benchmarks which are inadequate and (for Crypto++) based on tuned assembly code only available on Pentium4 processors with SSE.) I tested GMP and OpenSSL's BN for reference. Over the past few weeks I tore Crypto++ apart and modified a few things, only to find out that it has a persistent bug: woven throughout the library is a conversion from 32-bit to 64-bit ints using unions. This kind of transformation breaks the C's (and C++'s) aliasing rules (thanks to John Skaller for pointing out the problem), so compiling Crypto++ with optimisations turned on (g++ -O3) introduces failures, especially in the Division algorithms. I could change the unions to bitwise transformations with masks but I would have to really dig out all the references. After some more rigorous timing tests I found that I would have to essentially rewrite the algorithms anyway. What a mess. After some more research I found that there really are no other good public domain or BSD-compatible licensed libraries available. I tested two other free arbitrary precision integer libraries, MPI and MAPM, but they were also too slow, sometimes as much as 4 times slower. MAPM uses a Fast Fourier Transform (FFT) from Takuya Ooura (see http://momonga.t.u-tokyo.ac.jp/~ooura/fft.html) and should have been fast but turned out to be even slower than MPI. If you look at the ReplacingGMPNotes page I mentioned at the top of this email, the charts show that LibTomMath is weak in multiplication--at larger integer sizes (2048-4096 bits) it is half as fast as GMP, or worse. On the other hand ARPREC, which also uses a FFT algorithm, is slow at lower precisions (256-512 bits) for two reasons: (1) at relatively low precisions, ARPREC defaults to "faster" standard algorithms instead of its FFT and (2) when using its fast FFT at medium levels of precision (512) bits the FFT is too ponderous keep up with the relatively lighter and faster algorithms of the int-based libraries (GMP, OpenSSL's BN and LibTomMath). (As a little history, ARPREC used to be fairly bad relative to other FFT programs available but was redone after 2003 so by 2004 it was fairly good, it is up to version 2.1.94 now and it is much better; if you are looking at ARPREC benchmarks online prior to 2003, they are too old to be good indicators of its present capability.) I keep talking about ARPREC--why? For three reasons: (1) I trust its level of precision--this has been tested, see FFTW's benchmark page for accuracy: http://www.fftw.org/accuracy/G4-1.06GHz- gcc4/ (2) if you look at the charts, although ARPREC is bad in division it simply blows GMP, OpenSSL and LibTomMath away: at 4096 bits (85 doubles--each double has conservatively only 48.3 or so bits of integer precision), ARPREC can take a full random number to pow(n,7) in .98 seconds, compared to 77.61 or so seconds for the leader of the Integer-based libraries, GMP. (I ran the tests many times to make sure readings weren't flukes.) (3) of all the FFT-based arbitrary precision libraries available, ARPREC is the only BSD-licensed one--Takuya Ooura's library (used in MAPM) is only a FFT algorithm and not necessarily either fast or accurate. The rest of the FFT libraries available are essentially GMP-licensed. So I am in an unenviable position: I intend to fulfill my promise and get a replacement for GHC (and maybe more), but I have to essentially build better functionality into the front-runner, ARPREC. At present I have been working with vector-based algorithms that would enable me to use hardware-tuned code for Single Instruction Multiple Data (SIMD) chipsets. Currently I am researching algorithms based on operating-system supplied vector libraries. Part of this modification involves a fast transformation between a vector of large integers and an array of doubles, without loss of precision (although vectors of doubles are also working well, they do not have the same library support.) I am also working on enhancing ARPREC's division algorithm. This is the problem I face: GHC unfortunately does not use Integer as a mathematical operation but as a primitive type, complete with bitwise operations. From my experience, GHC users typically use Integers at lower precisions (typically under 256 bits) and they do not use Integer for higher math. (Integer math operations are basic primitives, as you already know.) I would like to build a library that is all-around fast, both under 256 bits and beyond 4096 bits (good FFT libraries operate over 100,000 bits). On Aug 24, 2006, at 2:39 PM, Thorkil Naur wrote:

...

I am truly unable to tell what I would consider the ideal situation. On the one hand, I value greatly the freedom of choosing my circumstances without restraints. And I appreciate the desire of noble people like the GHC developers to be equally free of such restraints. This would point in the direction of basing Integer computations on a fairly generic implementation. Which insightful consideration would then force us to admit would probably cost a factor of, say, 2 or more in performance in some cases.

The problem I face is: performance at which level? The low-bit-size (low precision) range or the high precision range?

...

Two problems seem to be lurking here: The first is that, although taking a copy of some existing library and massaging it to become working under some presently available circumstances may be fine, what is really needed is something that keeps on working, even under changing and future circumstances. The second is that if I wish to mix Haskell and non- Haskell code that processes the same data, I may find myself in need of a way to convert between whatever is required by this copy of some existing library that has been created and the presently available, but potentially quite different, version of that same library.

Absolutely correct. As you may have observed from your own experience, there are advantages to having GHC's RunTime System (RTS) handle the operations, either because garbage collected memory is faster to allocate or because the natural laziness between normally eager mathematical operations (mathematical operators are strict in their arguments but Haskell functions wrapping those operators may be lazy) means you can use GHC to strategically cache results of higher level polynomial operations. What I am working on is intended to be an independent library, designed to be stable for the long-term and interoperable with non-Haskell programs. Modifying ARPREC will result in a dynamic library (or DLL) that outputs simple arrays of doubles--this is how ARPREC currently functions; there is no need to wrap a C++ class in an opaque C pointer you cannot operate on independently. The huge downside to ARPREC and one of the least trivial of my problems is how to efficiently convert an array of doubles to precise integers, especially for cryptographic purposes.

...

I better be specific here: What I do is integer factorization (splitting integers into their prime factors). And, for example, I have implemented a version of the ECM (Elliptic Curve Method) in Haskell. Having done that (independently and mainly to learn what it was about), I eventually compared it in performance to GMP-ECM which Paul Zimmermann wrote. And I was initially disappointed to find that GMP-ECM was about 7 times faster than my ECM code. Fortunately, various sessions of "hand-waving" brought this factor down to about 1.5. So, whenever GMP-ECM used 2 minutes, my program would use 3 minutes.

One of the reasons I am working on a library written in C++ is that Haskell may be too slow for such things. GMP is also tuned code: if you look at the GMP source code, they have incorporated processor- specific assembly code in some places. On the other hand, GMP still has some rough corners--in some functions I have looked at carefully, an essentially unused function may be based on a header file that reads "not yet implemented."

...

And again: I am not done working with this, just holding an extended break. Please indicate if you wish to see some detailed and hard results.

My best question is: are you using pure mathematical formulas for this? GMP, LibTomMath and of course OpenSSL's BN, the pure-integer based libraries, at the polynomial level mostly use modular arithmetic algorithms, Karatsuba multiplication, Montgomery or Barrett reduction. On the low-level they use bitwise operations (say, right shift and add for multiplication--the gcc compiler seems to use this as the default). If you are using pure mathematical functions, ARPREC is fine (at least it is precise). Mathematically, there are alternative algorithms for floating point numbers that are not available for integers. Floating point numbers may use division by inverse (converting division into a multiplication operation by taking the inverse of the divisor (1/d) at a low precision and using addition, multiplication and subtraction until you reach the desired precision. Integer-based numbers may use Montgomery reduction (which you are no doubt already familiar with.)

...

Some additional experience is this: To gain some potentially easier- to-compare measurements between Haskell-with-GMP and C-with-GMP, I implemented the GMP function mpz_powm (raising a number to a power modulo a third number) as a primitive operation in GHC. And compared it to my own power-raising operation that was implemented in Haskell.

Overall, the GMP implementation of the power operation was 2 times faster than my Haskell implememntation. But, to make sure, using some techniques that my code certainly didn't use.

An interesting thing happened, however, for very large numbers: The GMP power operation started using a lot of memory, thrashing, and then, eventually, with sufficiently large operands, crashed for lack of memory. Where my Haskell-implemented version continued to provide usable results.

The explanation of this should undoubtably be found in the method (some FFT-thing) used for the power operation by GMP for very large numbers: My guess is that GMP when using this method issues a large number of allocate/free requests. And, again I am guessing, assuming that each allocate request causes the GHC runtime to allocate, whereas each free only causes a "we'll handle this at the next GC", then such large computations may very well display such behaviour.

Agreed. This is one of those cases where I think Haskell's laziness may have cached your intermediate results. I couldn't tell without looking at the source code. I do not think GMP uses an FFT for multiplication; I have not investigated this extensively. -Pete

Thorkil Naur

29 Dec 29 Dec

1:32 p.m.

Hello, Thanks a lot for your reply. Here are some comments to this. As is customary, I must apologise for the late reply (the response time for this conversation seems to increase exponentially with time) which also may very well make some of the comments totally out-dated. On Friday 01 September 2006 06:43, Peter Tanski wrote:

...

...

...
For a brief overview of the speed of the libraries I looked at carefully, see http://hackage.haskell.org/trac/ghc/wiki/ReplacingGMPNotes (I added a few charts to show the speed of ARPREC, OpenSSL's BN, GMP and LibTomMath. ....) I tested GMP and OpenSSL's BN for reference.

I must confess that it took me a while to understand what you were doing here and I am still uncertain: For example, how many multiplications were actually performed for bit size 4096? In addition, for "Powers", the markings on the horizontal axis ("256 pow(n,3)", "512 pow(n,4)", "1024 pow(n5)" (missing a "," here?), ...) on your graph seem to indicate that you are changing two different variables (the bit size and the exponent) at the same time. I would suggest that you also quoted the original measurements directly. And perhaps (especially in case of the "Powers" and "Division") some more details about what was actually measured. It is distinctly odd that squaring seems to be more expensive than multiplication for some bit sizes in 3 of the 4 measured cases. Also, I wonder what divisions these libraries are capable of carrying out so much faster than comparable multiplications. For the division measurements, I may again simply be presuming something about your experiments that simply isn't true.

...

... I keep talking about ARPREC--why? For three reasons: (1) I trust its level of precision--this has been tested, see FFTW's benchmark page for accuracy: http://www.fftw.org/accuracy/G4-1.06GHz- gcc4/

I am not sure I understand here: With respect to Haskell Integers, I would not say that there is any concept of a "level of precision": If Integer computations are not accurate, they are wrong.

...

(2) if you look at the charts, although ARPREC is bad in division it simply blows GMP, OpenSSL and LibTomMath away: at 4096 bits (85 doubles--each double has conservatively only 48.3 or so bits of integer precision), ARPREC can take a full random number to pow(n,7) in .98 seconds, compared to 77.61 or so seconds for the leader of the Integer-based libraries, GMP. (I ran the tests many times to make sure readings weren't flukes.)

Again, some additional details about these measurements would be most welcome.

...

(3) of all the FFT-based arbitrary precision libraries available, ARPREC is the only BSD-licensed one--Takuya Ooura's library (used in MAPM) is only a FFT algorithm and not necessarily either fast or accurate. The rest of the FFT libraries available are essentially GMP-licensed.

So I am in an unenviable position: I intend to fulfill my promise and get a replacement for GHC (and maybe more), but I have to essentially build better functionality into the front-runner, ARPREC. At present I have been working with vector-based algorithms that would enable me to use hardware-tuned code for Single Instruction Multiple Data (SIMD) chipsets. Currently I am researching algorithms based on operating-system supplied vector libraries. Part of this modification involves a fast transformation between a vector of large integers and an array of doubles, without loss of precision (although vectors of doubles are also working well, they do not have the same library support.) I am also working on enhancing ARPREC's division algorithm.

I looked briefly at ARPREC: It seems that it works with an explicitly set maximum bound on the number of digits desired. Although it also appears that it is possible to change this bound as computations proceed, such additional administration seems difficult to embrace.

...

This is the problem I face: GHC unfortunately does not use Integer as a mathematical operation but as a primitive type, complete with bitwise operations.

I do not understand what problem you are referring to here.

...

... On Aug 24, 2006, at 2:39 PM, Thorkil Naur wrote:

...
I am truly unable to tell what I would consider the ideal situation. On the one hand, I value greatly the freedom of choosing my circumstances without restraints. And I appreciate the desire of noble people like the GHC developers to be equally free of such restraints. This would point in the direction of basing Integer computations on a fairly generic implementation. Which insightful consideration would then force us to admit would probably cost a factor of, say, 2 or more in performance in some cases.

The problem I face is: performance at which level? The low-bit-size (low precision) range or the high precision range?

...
Two problems seem to be lurking here: The first is that, although taking a copy of some existing library and massaging it to become working under some presently available circumstances may be fine, what is really needed is something that keeps on working, even under changing and future circumstances. The second is that if I wish to mix Haskell and non- Haskell code that processes the same data, I may find myself in need of a way to convert between whatever is required by this copy of some existing library that has been created and the presently available, but potentially quite different, version of that same library.

Absolutely correct. As you may have observed from your own experience, there are advantages to having GHC's RunTime System (RTS) handle the operations, either because garbage collected memory is faster to allocate or because the natural laziness between normally eager mathematical operations (mathematical operators are strict in their arguments but Haskell functions wrapping those operators may be lazy) means you can use GHC to strategically cache results of higher level polynomial operations. What I am working on is intended to be an independent library, designed to be stable for the long-term and interoperable with non-Haskell programs. Modifying ARPREC will result in a dynamic library (or DLL) that outputs simple arrays of doubles--this is how ARPREC currently functions; there is no need to wrap a C++ class in an opaque C pointer you cannot operate on independently. The huge downside to ARPREC and one of the least trivial of my problems is how to efficiently convert an array of doubles to precise integers, especially for cryptographic purposes.

Could you explain exactly what this problem is? If it still makes sense to spend time on it, of course.

...

...
I better be specific here: What I do is integer factorization (splitting integers into their prime factors). And, for example, I have implemented a version of the ECM (Elliptic Curve Method) in Haskell. Having done that (independently and mainly to learn what it was about), I eventually compared it in performance to GMP-ECM which Paul Zimmermann wrote. And I was initially disappointed to find that GMP-ECM was about 7 times faster than my ECM code. Fortunately, various sessions of "hand-waving" brought this factor down to about 1.5. So, whenever GMP-ECM used 2 minutes, my program would use 3 minutes.

One of the reasons I am working on a library written in C++ is that Haskell may be too slow for such things.

I am not sure what you mean here. I would happily accept a price of, say, 10 or 20 percent reduced performance, if I got the ease of programming (mostly) in Haskell in return. And I am still not convinced that, in a computation that involves heavy Integer work, multiplications, divisions, greatest common divisor computations with, say, 200-1000 digit numbers, that the Haskell reduced performance price could not be kept as low as this.

...

... My best question is: are you using pure mathematical formulas for this?

I am not sure what this question means. Please elaborate.

...

GMP, LibTomMath and of course OpenSSL's BN, the pure-integer based libraries, at the polynomial level mostly use modular arithmetic algorithms, Karatsuba multiplication, Montgomery or Barrett reduction. On the low-level they use bitwise operations (say, right shift and add for multiplication--the gcc compiler seems to use this as the default). If you are using pure mathematical functions, ARPREC is fine (at least it is precise). Mathematically, there are alternative algorithms for floating point numbers that are not available for integers. Floating point numbers may use division by inverse (converting division into a multiplication operation by taking the inverse of the divisor (1/d) at a low precision and using addition, multiplication and subtraction until you reach the desired precision.

Again, I am not sure what you mean by this. It is certainly possible to compute useful inverses of integers without using a floating point representation. To be sure, the inverse (reciprocal) of an integer is not an integer, but it could still be computed in a suitable fixed point representation. See, for example, Knuth's "Seminumerical Algorithms" ("The art of computer programming", vol 2), end of section 4.3.1 and algorithm R in section 4.3.3.

...

Integer-based numbers may use Montgomery reduction (which you are no doubt already familiar with.)

...
Some additional experience is this: To gain some potentially easier- to-compare measurements between Haskell-with-GMP and C-with-GMP, I implemented the GMP function mpz_powm (raising a number to a power modulo a third number) as a primitive operation in GHC. And compared it to my own power-raising operation that was implemented in Haskell.

Overall, the GMP implementation of the power operation was 2 times faster than my Haskell implememntation. But, to make sure, using some techniques that my code certainly didn't use.

An interesting thing happened, however, for very large numbers: The GMP power operation started using a lot of memory, thrashing, and then, eventually, with sufficiently large operands, crashed for lack of memory. Where my Haskell-implemented version continued to provide usable results.

The explanation of this should undoubtably be found in the method (some FFT-thing) used for the power operation by GMP for very large numbers: My guess is that GMP when using this method issues a large number of allocate/free requests. And, again I am guessing, assuming that each allocate request causes the GHC runtime to allocate, whereas each free only causes a "we'll handle this at the next GC", then such large computations may very well display such behaviour.

Agreed. This is one of those cases where I think Haskell's laziness may have cached your intermediate results. I couldn't tell without looking at the source code. I do not think GMP uses an FFT for multiplication; I have not investigated this extensively.

I still believe that the excessive memory usage was caused by GMP allocating a lot of storage using the GHC allocator during a single mpz_powm call, so that the GHC garbage collector never got the opportunity to actually free the allocated storage. The mpz_powm function of gmp-4.2 that I am looking at calls mpn_sqr_n inside its main loop and that function itself allocates storage in some cases. In other cases, it calls mpn_mul_fft_full which, again, allocates storage.

...

-Pete ...

Elsewhere (http://www.haskell.org/pipermail/cvs-ghc/2006-October/032345.html), you wrote:

...

... I have been working on an arbitrary precision integer library from scratch but I have a version of GHC without any GMP in it right now ...

so it may very well be that some of the above remarks are not relevant any more. In addition, you wrote (http://www.haskell.org/pipermail/glasgow-haskell-users/2006-November/011601....):

...

... Integers are mostly a convenience. ...

I fully agree with that. Therefore, as I outline what I now consider the ideal situation with respect to Haskell and Integer support, keep in mind that I have very specific desires that may not be shared by any other Haskell users and which should be prioritized accordingly. My highest priority is still correctness. The thought that there might, in some special case, have crept an error into my basic arithmetic operators makes me shudder. In the present case of multiprecision integer operations, what worries me is the complications and the many special cases that implementors deal with. And not least special cases that address some particular architecture: This opens the possibility that code working on one machine stops working on another. It seems that my level of tolerance of such risks is not as high as most other people's. Personally, if I were to write a multiprecision library myself, I would build in all sorts of possibilities for verifying that everything remains in place and results computed were indeed correct. For example, I understand that memory errors occur, not often, but often enough to be significant for such longwinding computations (days, months, years) that I envision. Other people are doing long computations (e.g. http://www.mersenne.org/) and they exercise some care in the form of test programs that are intended to verify that a particular machine is sufficiently stable. I would probably carry this even further, by including, even in "production" runs, various checks (individual arithmetic operations verified by checks modulo some number, say). Another thing that bothers me is the unpredictability of the performance of these multiprecision packages. When I look at the GMP code, there are countless special cases that are taken into account. Each of them, of course, represents the implementor's desire to provide me, the user, with the best service (performance) possible. But it makes measuring performance a nightmare, because you never know, for sure, whether some special case has been taken into account in a particular situation. This was brought home to me most forcefully some months back: I have GHC-compiled programs running on a couple of Linux machines (not the same Linux distribution), but had used some precompiled GMP package on both of these machines. After looking at the GMP code, I realised (as I should have much earlier, of course) that I should use a package that was compiled for the specific machine I was working on to ensure that all the various special performance tricks could come into play. So I did. And performance improved on one machine, but degraded on another. I am still not sure what the explanation of this behaviour really is. But one thing I am certainly concerned about is whether, in fact, the GMP build process is actually able to guess some machine architecture correctly and ensure that its potential is used to the fullest extent. One thing that seems missing in many packages is the possibility to experiment with the tuning parameters more dynamically. For example, GMP includes a tuning program that is able to suggest values for certain parameters. But these parameters then have to be compiled into the package for final use. Not particularly elegant and something which, surely, could be changed, without significant overhead. The conclusion, for me at least, is this: If you are a naive user of Integers, you are much better off with a completely simple, no-special-cases, solid and working, multiprecision package. Which might even be fairly portable. For heavy users (such as me) with special requirements, it seems that they can never get the ultimate performance without having considerable insight into the details of the matter anyway. And, hence, they need to do some complex work themselves to get the desired performance. For Haskell, this would mean ensuring that the interface between GHC-compiled code and some multiprecision package was reasonably comprehensible and allowed the multiprecision support to be replaced. And here I try to avoid the usual dream of "easy replacement", that seems entirely unrealistic. But something reasonable that the rest of GHC (and not least: The GHC developers) can live with without too much work. And that heavy users, with sufficient insight and probably a significant dose of hard work (such as that you have put into this) would be able to use for their own purposes. Best regards Thorkil

Peter Tanski

3 Jan 3 Jan

5:35 a.m.

On Dec 29, 2006, at 8:32 AM, Thorkil Naur wrote:

...

On Friday 01 September 2006 06:43, Peter Tanski wrote:

...
... For a brief overview of the speed of the libraries I looked at carefully, see http://hackage.haskell.org/trac/ghc/wiki/ ReplacingGMPNotes (I added a few charts to show the speed of ARPREC, OpenSSL's BN, GMP and LibTomMath. ....) I tested GMP and OpenSSL's BN for reference.

I must confess that it took me a while to understand what you were doing here and I am still uncertain: For example, how many multiplications were actually performed for bit size 4096?

4096 = size 4 (counting from zero: sizes[5] = {256,512,1024,2048,4096}) 1,000,000 / ( 4 * 3 ) = 83,333 rounds My reason for doing this was simple: as a basic comparison, rather than an absolute measurement, the number of rounds doesn't matter as long as the results are measurable. Besides, more rounds means more time running each test. I did a lot of tweaking, especially with ARPREC, to get each library to (1) a generally available speed and (2) a configuration similar to what it would be when used with GHC. I could have measured the speed in nanoseconds, with one iteration for each calculation using random numbers of a specified size and posting the mean for a number of trials but that would have required me to use OS-X specific profiling software like Shark in order to get reliable measurements--a bit more labour intensive as it would require me to manually perform each test. (I did use random numbers of a specified size.)

...

In addition, for "Powers", the markings on the horizontal axis ("256 pow(n,3)", "512 pow(n,4)", "1024 pow (n5)" (missing a "," here?), ...) on your graph seem to indicate that you are changing two different variables (the bit size and the exponent) at the same time.

Yes, the testing is a bit sloppy there (so is the graph; ugly typo). The graph shows a general trend more than anything else. I actually tested the Exponentials (Powers) individually for each size and timing but posting a 3-D graph or making the graph (time = exponent/ size) seemed like overkill or would conflict with the previous measurements. Not a bad idea, though, just for clarity.

...

I would suggest that you also quoted the original measurements directly. And perhaps (especially in case of the "Powers" and "Division") some more details about what was actually measured.

I did quote the original measurements directly. There wasn't much variance overall and I took what seemed like median results from a number of tests. What matters is the relative time to initialise and perform each computation since in a GHC-implementation each computation would require some minimal initialisation. ARPREC was built for this and in ARPREC-only tests the major factor in speed of initialisation was the time to compute the architecture and precision- specific constants for PI, Log_2 and Log_10 (the Epsilon constant doesn't require much time). Log_2 and Log_10 are necessary for printing Integers because computations in ARPREC are performed as floating-point values and must converted to decimal digits by multiplying by Log_10. (Note that printing Integers also requires a size increase as the mantissa holds the Integer value, requiring further multiplication by the float-exponent.) Details on differences between algorithms used in each library would be fairly complex: as you already know, each library (ARPREC, GMP, LibTomMath, etc.) uses a different algorithm based on the size of the number-array *and* each may have a different implementation of an algorithm--LibTomMath uses a simpler Karatsuba algorithm, for example.

...

It is distinctly odd that squaring seems to be more expensive than multiplication for some bit sizes in 3 of the 4 measured cases.

This is also an implementation matter: the particular algorithms change with size and squaring may require some extra initialisation for, say, computing the size of the result and the number of operations. All of the libraries provide special squaring functions and I used those. LibTomMath is a good example: it uses its "baseline" squaring algorithm for small sizes and Comba-optimised Toom-Cook or Karatsuba algorithms for larger sizes. (I purposely did not tune LibTomMath or any of the libraries because I wanted a more- or-less average-case comparison, so the Karatsuba algorithm was used for size=512 bytes (128 digits) and Toom-Cook was used for size=2048 bytes (512 digits).) So where you see LibTomMath's time dip in the middle of the 'Squaring' graph it is using the Karatsuba algorithm. ARPREC uses a FFT for everything above size=256 and calculates with fewer actual digits (it stores extra "size" as an exponent, just like ordinary floating point numbers). The trends in the graphs for ARPREC versus GMP generally hold true until GMP's FFT kicks in, at size > 32768.

...

Also, I wonder what divisions these libraries are capable of carrying out so much faster than comparable multiplications. For the division measurements, I may again simply be presuming something about your experiments that simply isn't true.

As for Division, here is the short answer why ARPREC was so much slower than the rest and why LibTomMath was so fast where a corresponding LibTomMath multiplication was relatively slower. ARPREC computes division as long division for small operands (size=256) and multiplication by the reciprocal (Newton-Raphson iteration followed by Karp's trick to eliminate trailing errors in the mantissa) for larger operations (>=512): that is why ARPREC is so slow at size=256 and grows progressively faster for greater sizes. LibTomMath division relies on subtraction at first to compute the approximate number of iterations--it computes the *entire* number as an approximation rather than GMP's recursive division from the paper, C. Burnikel and J. Ziegler, "Fast Recursive Division" (1998) . For both LibTomMath and ARPREC, if the divisor is greater than the dividend the "division" calculation of an integer (without remainder) is very fast since the result is 0, so for those two I had to check that the random-number divisor was less than the dividend before calculation.

...

...
... I keep talking about ARPREC--why? For three reasons: (1) I trust its level of precision--this has been tested, see FFTW's benchmark page for accuracy: http://www.fftw.org/accuracy/G4-1.06GHz- gcc4/

I am not sure I understand here: With respect to Haskell Integers, I would not say that there is any concept of a "level of precision": If Integer computations are not accurate, they are wrong.

FFT's must be performed with floating point or fixed point numbers (unless you want to deal with some mad algorithm involving continued fractions): there will always be some approximation so I always consider precision. The same might be said for division with remainder for some algorithms. All this holds especially true when you must round the approximation to what users would expect to be an exact integer.

...

Again, some additional details about these measurements would be most welcome.

I hope the above explanations help a little.

...

I looked briefly at ARPREC: It seems that it works with an explicitly set maximum bound on the number of digits desired. Although it also appears that it is possible to change this bound as computations proceed, such additional administration seems difficult to embrace.

The size relation for operations corresponds to magnitude, so addition and subtraction operations grow or shrink linearly while multiplication, division and powers (exponents) change logarithmically (roughly, log_2). It would add an extra check on initialisation for every operation (or a string of operations if they were chained together) but it would not be large price. This is actually done internally by all the other libraries, anyway. The simple solution is to put the check into each function in the ARPREC file c_mp.cpp.

...

...
This is the problem I face: GHC unfortunately does not use Integer as a mathematical operation but as a primitive type, complete with bitwise operations.

I do not understand what problem you are referring to here.

It is both an aesthetic and an implementation problem. If Integer were a mathematical library the primary consideration would be mathematical operations and the user would choose operations to perform a more complex calculation or series of calculations. For GMP or LibTomMath you may know the general size of the operands in your program and use different functions for multiplication, division, squaring, powers, roots or other things like modular arithmetic and number theory. As a pedestrian primitive Integer must support the basic operations (only the basic operations) of a primitive type. All multiplication goes through one function and the user relies on the library to choose the appropriate one. The user would not be able to choose a special multiplication algorithm because he knows the numbers will be, say, larger than 256 bits in size or modularly related. This puts the burden on the compiler to optimise the library for the user, if possible. Meanwhile, as a primitive Integers in Haskell only support a basic set of operations: any extra operations would have to be provided in a special library. (Though I may add extra primitive operations to GHC, this would be nonstandard). As an aesthetic consideration, I mentioned bitwise operations because they are atomic, primitive operations typically associated with machine types (int32_t, uint64_t, etc.).

...

...
Modifying ARPREC will result in a dynamic library (or DLL) that outputs simple arrays of doubles--this is how ARPREC currently functions; there is no need to wrap a C++ class in an opaque C pointer you cannot operate on independently. The huge downside to ARPREC and one of the least trivial of my problems is how to efficiently convert an array of doubles to precise integers, especially for cryptographic purposes.

Could you explain exactly what this problem is? If it still makes sense to spend time on it, of course.

This point is somewhat moot since I am no longer using ARPREC :) The first problem is opaque data types from C++/C. If ARPREC stored the resulting number as a Class object with members of different types (ints and doubles), like this: -- apint.hpp class APInt { public: // ... private: int size; double exponent; double* mantissa; }; Then, in a separate .cpp file I would create functions callable from C: -- c_apint.h extern "C" { APInt* newAPInt (int size, ...); void destroyAPInt (APInt* x); int addAPInt (APInt* x, APInt* y, APInt* res); // ... } // end extern "C" The way to work with this in C (and Haskell) would be an opaque pointer: typedef struct APInt* pAPInt; From C, I could never access the members of the struct directly so any special Haskell functions for modifying those members would have to be special functions in the c-interface file, 'c_apint.h'. An additional problem, somewhat related to the opaque-object problem is converting an array of doubles to a (precise) array of uint32_t. Individual conversions between doubles and ints are expensive and you have to be very careful about rounding. Alternatives that avoid rounding are: (1) normalise the arbitrary precision double (and all the doubles in the array) and use a union to access the mantissa of each double in the array directly, and convert each to a uint32_t while carrying the overflow from each mantissa to the next uint32_t; or, (2) convert the array of doubles to a string and convert the string to an array of uint32_t. Both methods are expensive--you would have to multiply the arbitrary precision double by log_10 and normalise it, to say nothing of the conversion. Worse, for Haskell Integers some conversion would be necessary, unless you want to use arithmetic means to perform the bitwise primitive operations.

...

...
...
I better be specific here: What I do is integer factorization (splitting integers into their prime factors). And, for example, I have implemented a version of the ECM (Elliptic Curve Method) in Haskell. Having done that (independently and mainly to learn what it was about), I eventually compared it in performance to GMP-ECM which Paul Zimmermann wrote. And I was initially disappointed to find that GMP-ECM was about 7 times faster than my ECM code. Fortunately, various sessions of "hand-waving" brought this factor down to about 1.5. So, whenever GMP-ECM used 2 minutes, my program would use 3 minutes.

One of the reasons I am working on a library written in C++ is that Haskell may be too slow for such things.

I am not sure what you mean here. I would happily accept a price of, say, 10 or 20 percent reduced performance, if I got the ease of programming (mostly) in Haskell in return. And I am still not convinced that, in a computation that involves heavy Integer work, multiplications, divisions, greatest common divisor computations with, say, 200-1000 digit numbers, that the Haskell reduced performance price could not be kept as low as this.

You could be right--at present I am fairly pessimistic about the various overheads imposed by Haskell for such things as arrays but that does not mean I am correct. You may have noticed my mention of the BigFloat library in a somewhat related thread, "bignums, gmp, bytestring ... ?," at http://www.haskell.org/pipermail/glasgow- haskell-users/2006-November/011601.html . BigFloat is an arbitrary precision library Martin Guy originally wrote in Miranda and ported to Haskell: it is a good library but a little slow (see timing results for the Many Digits "Friendly" Competition, where BigFloat competed as Bignum, at http:// www.cs.ru.nl/~milad/manydigits/final_timings_rankings.html ). Of course this is only one library and it is performing some complex calculations. It is interesting to note that BigFloat *is* precise and uses [Int] for the mantissa. There is a note on the BigFloat page about coming close to finding the highest computed power of 2 after more than a month of computation (before someone turned the power off). Of course, you are correct in assuming some of this moot. I am no longer working on a C++ library: it is written in C with some assembler and in my spare time I have been toying with a hybrid Haskell-C system where the Haskell portion of the system works to optimise basic bits of C code. (I mentioned the idea for this in the November 2006 thread linked above.) This should give users the full flexibility of Haskell with the speed of C, though writing it portably is proving to be a real hassle. I have gotten a little side- tracked porting GHC to be Windows-native in my quest to use a better compiler than gcc 3.4 (the best available for MinGW), namely CL, but it is all to the good. I had a lot to learn about Cmm and the Native Code Generator (NCG)--at one time I thought I knew something about Cmm; it was naive. There is a lot of room for NCG optimisations, such as auto-vectorization and floating point exceptions.

...

...
My best question is: are you using pure mathematical formulas for this?

I am not sure what this question means. Please elaborate.

It was a question out of curiosity: for your ECM work, are you able to implement the mathematical formulae directly in Haskell? (Yes, I believe "formulae" is the correct plural of "formula.") C programs require many special operations for dealing with the data and a fast Haskell program would probably have to use a few (maybe I am a bad Haskell programmer?).

...

...
If you are using pure mathematical functions, ARPREC is fine (at least it is precise). Mathematically, there are alternative algorithms for floating point numbers that are not available for integers...

Again, I am not sure what you mean by this. It is certainly possible to compute useful inverses of integers without using a floating point representation. To be sure, the inverse (reciprocal) of an integer is not an integer, but it could still be computed in a suitable fixed point representation. See, for example, Knuth's "Seminumerical Algorithms" ("The art of computer programming", vol 2), end of section 4.3.1 and algorithm R in section 4.3.3.

Fixed point is certainly an option but it is not a "natural" operation over ints and may add more overhead than necessary considering the relative speed advantages of modern floating point processors. (By "not available," I was talking about hardware support though I'll admit my typo-riddled prose may have thrown you off--"Mathematically," may have been an artifact.) All of the fastest integer division algorithms I am aware of do not use fixed point for division by inverse multiplication.

...

I still believe that the excessive memory usage was caused by GMP allocating a lot of storage using the GHC allocator during a single mpz_powm call, so that the GHC garbage collector never got the opportunity to actually free the allocated storage. The mpz_powm function of gmp-4.2 that I am looking at calls mpn_sqr_n inside its main loop and that function itself allocates storage in some cases. In other cases, it calls mpn_mul_fft_full which, again, allocates storage.

O.K. I misunderstood your program: I was under the impression that you wrote the GMP program in C. That is an interesting insight concerning the GHC garbage collector.

...

so it may very well be that some of the above remarks are not relevant any more. In addition, you wrote (http://www.haskell.org/pipermail/glasgow-haskell-users/2006- November/011601.html):

...
... Integers are mostly a convenience. ...

...

My highest priority is still correctness. The thought that there might, in some special case, have crept an error into my basic arithmetic operators makes me shudder. In the present case of multiprecision integer operations, what worries me is the complications and the many special cases that implementors deal with. And not least special cases that address some particular architecture: This opens the possibility that code working on one machine stops working on another.

The general trend for floating or fixed point libraries that seek precision is low to medium speed; there are many fast libraries out there which are faster but not as precise. One example is the BOOST uBLAS library: it is slower than BLAS but more precise.

...

It seems that my level of tolerance of such risks is not as high as most other people's. Personally, if I were to write a multiprecision library myself, I would build in all sorts of possibilities for verifying that everything remains in place and results computed were indeed correct.

Ideally, these should be handled by the compiler-system rather than run-time checks in code.

...

For example, I understand that memory errors occur, not often, but often enough to be significant for such longwinding computations (days, months, years) that I envision. Other people are doing long computations (e.g. http://www.mersenne.org/) and they exercise some care in the form of test programs that are intended to verify that a particular machine is sufficiently stable. I would probably carry this even further, by including, even in "production" runs, various checks (individual arithmetic operations verified by checks modulo some number, say).

That is an excellent point. No arbitrary (multi) precision library I know of provides a facility for performing such checks. It would be a nice thing to add.

...

Another thing that bothers me is the unpredictability of the performance of these multiprecision packages. When I look at the GMP code, there are countless special cases that are taken into account. Each of them, of course, represents the implementor's desire to provide me, the user, with the best service (performance) possible. But it makes measuring performance a nightmare, because you never know, for sure, whether some special case has been taken into account in a particular situation.

One reason I kept the comparison of different libraries very basic and general. The best performance comes from special tuning tied to a particular equation, as with FFTW.

...

This was brought home to me most forcefully some months back: I have GHC-compiled programs running on a couple of Linux machines (not the same Linux distribution), but had used some precompiled GMP package on both of these machines. After looking at the GMP code, I realised (as I should have much earlier, of course) that I should use a package that was compiled for the specific machine I was working on to ensure that all the various special performance tricks could come into play. So I did. And performance improved on one machine, but degraded on another. I am still not sure what the explanation of this behaviour really is. But one thing I am certainly concerned about is whether, in fact, the GMP build process is actually able to guess some machine architecture correctly and ensure that its potential is used to the fullest extent.

This is a very big problem with every fast mathematical library. The GMP build system is primitive (it tends to rely on assembler blocks written for particular types of machines but does not perform further testing). You know how ATLAS uses a specialised build process to automatically optimise code for a particular machine and architecture. It is very delicate and once you have built the library as a developer you cannot expect the same performance if you package the library with a program designed for one operating system and many different (but related) machines: a program for OS X 10.4 built on a ppc7450 (G4) will run differently on a G3 or G5. My own problem involves the different variations for Altivec and SSE on different PPC and Intel/AMD processors, especially if I include any assembler code but also if the library is compiled by a different compiler or compiler version. I should probably make precision the highest priority, portability, second and speed, third.

...

One thing that seems missing in many packages is the possibility to experiment with the tuning parameters more dynamically. For example, GMP includes a tuning program that is able to suggest values for certain parameters. But these parameters then have to be compiled into the package for final use. Not particularly elegant and something which, surely, could be changed, without significant overhead.

GMP is unfortunately an open-source project, academic or commercial. Some developer would have to spend the time to do the tedious work of weighing the different parameter combinations so the tuning program could determine and input the best performance choices and report back.

...

The conclusion, for me at least, is this: If you are a naive user of Integers, you are much better off with a completely simple, no-special-cases, solid and working, multiprecision package. Which might even be fairly portable.

For heavy users (such as me) with special requirements, it seems that they can never get the ultimate performance without having considerable insight into the details of the matter anyway. And, hence, they need to do some complex work themselves to get the desired performance.

A hybrid solution that gave basic Integer functionality but allowed for variations in a special library would be better. Integer may be a Primitive but it should really be a separate Haskell library. Claus Reinke had a good comment about not automatically importing Integer when a user doesn't need it.

...

For Haskell, this would mean ensuring that the interface between GHC-compiled code and some multiprecision package was reasonably comprehensible and allowed the multiprecision support to be replaced. And here I try to avoid the usual dream of "easy replacement", that seems entirely unrealistic. But something reasonable that the rest of GHC (and not least: The GHC developers) can live with without too much work. And that heavy users, with sufficient insight and probably a significant dose of hard work (such as that you have put into this) would be able to use for their own purposes.

GMP *would* have been easy to replace but, like so much else in GHC, it was hacked in and then hacked in again. No offense to those who have spent more than a decade working on GHC--there is still working code from the early 90's, at least in the Haskell libraries--but I guess it suffers from how fun and easy it is to write in Haskell and the comparative drudgery of cleaning things up. Modularity is key. I really have to stop playing around and post a prototype so people (like you) can have something to work with and criticise :) Cheers, Pete

John Meacham

24 Jan 24 Jan

1:49 a.m.

I think the benchmarks are flawed in an important way, I believe, (but am not positive) that ARPREC uses a special factorized form of representing numbers, which makes multiplicative type operations extremely fast, but simple things like addition/subtraction quite slow. you are only benchmarking multiplicative or similar routines, giving ARPREC a huge lead, when in practice it might end up being slowest, as addition/subtraction are extremely more common than multiplication. Also, how are you choosing the numbers to test with? it is possible that some packages are using 'sparse' representations or other specialized forms if all your numbers happen to be powers of two or something. also, pretty much all uses of integers will be for very small integers, we should be careful to not lose sight of speed for the common case due to pretty asymtotic bounds. I mean, it would be great to be proven wrong and find out ARPREC was great across the board. :) John -- John Meacham - ⑆repetae.net⑆john⑈

John Goerzen

12 Aug 12 Aug

4:03 p.m.

On 2006-08-10, Peter Tanski wrote:

...

Summary: I finally settled on modifying OpenSSL, since that would be the easiest to work with under GHC's hood (plain C code, not C++). I have to:

Have you carefully investigated the OpenSSL license? We in Debian have had repeated problems since the OpenSSL license is, as written, incompatible with the GPL (even linking to OpenSSL is incompatible with the GPL). I would hate to have a situation where all GHC-compiled programs can't be under the GPL. -- John

6741

Age (days ago)

6908

Last active (days ago)

List overview

Download

29 comments

11 participants

participants (11)

Alec Berryman
Bulat Ziganshin
John Goerzen
John Meacham
Peter Tanski
Reilly Hayes
Simon Marlow
Simon Marlow
Simon Peyton-Jones
skaller
Thorkil Naur