Re: Replacement for GMP: Update

1 Sep 2006

      Hello Thorkil,

I am very sorry for the late reply.  I have been extremely busy and I  
wanted to give you a coherent answer.

For a brief overview of the speed of the libraries I looked at  
carefully, see
http://hackage.haskell.org/trac/ghc/wiki/ReplacingGMPNotes
(I added a few charts to show the speed of ARPREC, OpenSSL's BN, GMP  
and LibTomMath.  I did not add speed tests for Crypto++ and Botan  
because they don't measure up.  The original timings I obtained for  
them were based on their own benchmarks which are inadequate and (for  
Crypto++) based on tuned assembly code only available on Pentium4  
processors with SSE.)  I tested GMP and OpenSSL's BN for reference.

Over the past few weeks I tore Crypto++ apart and modified a few  
things, only to find out that it has a persistent bug: woven  
throughout the library is a conversion from 32-bit to 64-bit ints  
using unions.  This kind of transformation breaks the C's (and C++'s)  
aliasing rules (thanks to John Skaller for pointing out the problem),  
so compiling Crypto++ with optimisations turned on (g++ -O3)  
introduces failures, especially in the Division algorithms.  I could  
change the unions to bitwise transformations with masks but I would  
have to really dig out all the references.  After some more rigorous  
timing tests I found that I would have to essentially rewrite the  
algorithms anyway.  What a mess.

After some more research I found that there really are no other good  
public domain or BSD-compatible licensed libraries available.  I  
tested two other free arbitrary precision integer libraries, MPI and  
MAPM, but they were also too slow, sometimes as much as 4 times  
slower.  MAPM uses a Fast Fourier Transform (FFT) from Takuya Ooura  
(see http://momonga.t.u-tokyo.ac.jp/~ooura/fft.html) and should have  
been fast but turned out to be even slower than MPI.

If you look at the ReplacingGMPNotes page I mentioned at the top of  
this email, the charts show that LibTomMath is weak in  
multiplication--at larger integer sizes (2048-4096 bits) it is half  
as fast as GMP, or worse.  On the other hand ARPREC, which also uses  
a FFT algorithm, is slow at lower precisions (256-512 bits) for two  
reasons: (1) at relatively low precisions, ARPREC defaults to  
"faster" standard algorithms instead of its FFT and (2) when using  
its fast FFT at medium levels of precision (512) bits the FFT is too  
ponderous keep up with the relatively lighter and faster algorithms  
of the int-based libraries (GMP, OpenSSL's BN and LibTomMath).  (As a  
little history, ARPREC used to be fairly bad relative to other FFT  
programs available but was redone after 2003 so by 2004 it was fairly  
good, it is up to version 2.1.94 now and it is much better; if you  
are looking at ARPREC benchmarks online prior to 2003, they are too  
old to be good indicators of its present capability.)

I keep talking about ARPREC--why?  For three reasons:
(1) I trust its level of precision--this has been tested, see FFTW's  
benchmark page for accuracy: http://www.fftw.org/accuracy/G4-1.06GHz- 
gcc4/
(2) if you look at the charts, although ARPREC is bad in division it  
simply blows GMP, OpenSSL and LibTomMath away: at 4096 bits (85  
doubles--each double has conservatively only 48.3 or so bits of  
integer precision), ARPREC can take a full random number to pow(n,7)  
in .98 seconds, compared to 77.61 or so seconds for the leader of the  
Integer-based libraries, GMP.  (I ran the tests many times to make  
sure readings weren't flukes.)
(3) of all the FFT-based arbitrary precision libraries available,  
ARPREC is the only BSD-licensed one--Takuya Ooura's library (used in  
MAPM) is only a FFT algorithm and not necessarily either fast or  
accurate.  The rest of the FFT libraries available are essentially  
GMP-licensed.

So I am in an unenviable position: I intend to fulfill my promise and  
get a replacement for GHC (and maybe more), but I have to essentially  
build better functionality into the front-runner, ARPREC.  At present  
I have been working with vector-based algorithms that would enable me  
to use hardware-tuned code for Single Instruction Multiple Data  
(SIMD) chipsets.  Currently I am researching algorithms based on  
operating-system supplied vector libraries.  Part of this  
modification involves a fast transformation between a vector of large  
integers and an array of doubles, without loss of precision (although  
vectors of doubles are also working well, they do not have the same  
library support.)  I am also working on enhancing ARPREC's division  
algorithm.

This is the problem I face: GHC unfortunately does not use Integer as  
a mathematical operation but as a primitive type, complete with  
bitwise operations.  From my experience, GHC users typically use  
Integers at lower precisions (typically under 256 bits) and they do  
not use Integer for higher math.  (Integer math operations are basic  
primitives, as you already know.)  I would like to build a library  
that is all-around fast, both under 256 bits and beyond 4096 bits  
(good FFT libraries operate over 100,000 bits).

On Aug 24, 2006, at 2:39 PM, Thorkil Naur wrote:
...
I am truly unable to tell what I would consider the ideal  
situation. On the
one hand, I value greatly the freedom of choosing my circumstances  
without
restraints. And I appreciate the desire of noble people like the GHC
developers to be equally free of such restraints. This would point  
in the
direction of basing Integer computations on a fairly generic  
implementation.
Which insightful consideration would then force us to admit would  
probably
cost a factor of, say, 2 or more in performance in some cases.
The problem I face is: performance at which level?  The low-bit-size  
(low precision) range or the high precision range?
...
Two problems seem to be lurking here: The first is that, although  
taking a
copy of some existing library and massaging it to become working  
under some
presently available circumstances may be fine, what is really  
needed is
something that keeps on working, even under changing and future
circumstances. The second is that if I wish to mix Haskell and non- 
Haskell
code that processes the same data, I may find myself in need of a  
way to
convert between whatever is required by this copy of some existing  
library
that has been created and the presently available, but potentially  
quite
different, version of that same library.
Absolutely correct.  As you may have observed from your own  
experience, there are advantages to having GHC's RunTime System (RTS)  
handle the operations, either because garbage collected memory is  
faster to allocate or because the natural laziness between normally  
eager mathematical operations (mathematical operators are strict in  
their arguments but Haskell functions wrapping those operators may be  
lazy) means you can use GHC to strategically cache results of higher  
level polynomial operations.  What I am working on is intended to be  
an independent library, designed to be stable for the long-term and  
interoperable with non-Haskell programs.  Modifying ARPREC will  
result in a dynamic library (or DLL) that outputs simple arrays of  
doubles--this is how ARPREC currently functions; there is no need to  
wrap a C++ class in an opaque C pointer you cannot operate on  
independently.  The huge downside to ARPREC and one of the least  
trivial of my problems is how to efficiently convert an array of  
doubles to precise integers, especially for cryptographic purposes.
...
I better be specific here: What I do is integer factorization  
(splitting
integers into their prime factors). And, for example, I have  
implemented a
version of the ECM (Elliptic Curve Method) in Haskell. Having done  
that
(independently and mainly to learn what it was about), I eventually  
compared
it in performance to GMP-ECM which Paul Zimmermann wrote. And I was  
initially
disappointed to find that GMP-ECM was about 7 times faster than my  
ECM code.
Fortunately, various sessions of "hand-waving" brought this factor  
down to
about 1.5. So, whenever GMP-ECM used 2 minutes, my program would use 3
minutes.
One of the reasons I am working on a library written in C++ is that  
Haskell may be too slow for such things.  GMP is also tuned code: if  
you look at the GMP source code, they have incorporated processor- 
specific assembly code in some places.  On the other hand, GMP still  
has some rough corners--in some functions I have looked at carefully,  
an essentially unused function may be based on a header file that  
reads "not yet implemented."
...
And again: I am not done working with this, just holding an  
extended break.
Please indicate if you wish to see some detailed and hard results.
My best question is: are you using pure mathematical formulas for  
this?  GMP, LibTomMath and of course OpenSSL's BN, the pure-integer  
based libraries, at the polynomial level mostly use modular  
arithmetic algorithms, Karatsuba multiplication, Montgomery  or  
Barrett reduction.  On the low-level they use bitwise operations  
(say, right shift and add for multiplication--the gcc compiler seems  
to use this as the default).  If you are using pure mathematical  
functions, ARPREC is fine (at least it is precise).  Mathematically,  
there are alternative algorithms for floating point numbers that are  
not available for integers.  Floating point numbers may use division  
by inverse (converting division into a multiplication operation by  
taking the inverse of the divisor (1/d) at a low precision and using  
addition, multiplication and subtraction until you reach the desired  
precision. Integer-based numbers may use Montgomery reduction (which  
you are no doubt already familiar with.)
...
Some additional experience is this: To gain some potentially easier- 
to-compare
measurements between Haskell-with-GMP and C-with-GMP, I implemented  
the GMP
function mpz_powm (raising a number to a power modulo a third  
number) as a
primitive operation in GHC. And compared it to my own power-raising  
operation
that was implemented in Haskell.
Overall, the GMP implementation of the power operation was 2 times  
faster than
my Haskell implememntation. But, to make sure, using some  
techniques that my
code certainly didn't use.
An interesting thing happened, however, for very large numbers: The  
GMP power
operation started using a lot of memory, thrashing, and then,  
eventually,
with sufficiently large operands, crashed for lack of memory. Where my
Haskell-implemented version continued to provide usable results.
The explanation of this should undoubtably be found in the method  
(some
FFT-thing) used for the power operation by GMP for very large  
numbers: My
guess is that GMP when using this method issues a large number of
allocate/free requests. And, again I am guessing, assuming that each
allocate request causes the GHC runtime to allocate, whereas each  
free only
causes a "we'll handle this at the next GC", then such large  
computations
may very well display such behaviour.
Agreed.  This is one of those cases where I think Haskell's laziness  
may have cached your intermediate results.  I couldn't tell without  
looking at the source code.  I do not think GMP uses an FFT for  
multiplication; I have not investigated this extensively.

-Pete

Re: Replacement for GMP: Update

Peter Tanski