Re: FFI Bindings to Libraries using GMP

18 Sep 2007


      On Sep 14, 2007, at 9:14 AM, Benedikt Huber wrote:
...
...
| I've been struggling using FFI bindings to libraries which rely  
on the
| GNU Mp Bignum library (gmp).
It's an issue that bites very few users, but it bites them hard.   
It's also tricky, but not impossible, to fix.  The combination  
keeps meaning that at GHC HQ we work on things that affect more  
people. I doubt we can spare effort to design and implement a fix  
in the near future -- we keep hoping someone else step up and  
tackle it!
Peter Tanski did exactly that (he's the author of the  
ReplacingGMPNotes above), but he's been very quiet recently.   I  
don't know where he is up to.  Perhaps someone else would like to  
join in?
I apologise for being away.  The company I work for has been ramping  
up for a launch and I have been working very long hours (nights and  
weekends, too).
...
Thank you for the information - I'm also willing to help, though  
I'm not too familiar with the GHC internals (yet).
I do like the idea of optionally linking with a pure-haskell  
library, but I'm interested in a solution with comparable  
performance. Commenting solutions to ticket #311:
It goes beyond mere familiarity with the internals: the GMP functions  
are threaded throughout the RTS and the PrimOps files.  Of all the  
primitive operations, they are the most ubiquitous for interfering in  
other code.  The rough list I put on the ReplacingGMP page is a start  
but the more I work with the RTS the more little things keep turning  
up.  At the least I should update the page.
...
(2) Using the standard allocation functions for the gmp memory  
managment (maybe as compile flag) as suggested in http:// 
www.haskell.org/pipermail/glasgow-haskell-users/2006-July/ 
010660.html would also resolve ticket #311.
In this case at least the dynamic part of gmp integers has to be  
resized using external allocation functions, and a finalizer  
(mpz_clear) has to be called when an Integer is garbage collected.
It seems that the performance loss by using malloc is significant  
[1], as lots of allocations and reallocations of very small chunks  
occur in a functional setting; some kind of (non garbage  
collected !) memory pool allocation would certainly help. I'm not  
sure what overhead is associated with calling a finalizer ?
The problem of lots of small allocations affects the garbage  
collector, as well.  In the current implementation, each GMP  
operation calls doYouWantToGC()--I'm sure you have seen the note in  
PrimOps.cmm, for example--which may act as a stop-world garbage  
collection.  The byte arrays for GMP are also pinned.  Compared to  
this, a FFI implementation using finalizers, which have horrible but  
practical guarantees on when they are called, would work much  
better.  The best solution would be to revamp the way Integer types  
are implemented, so when possible they are mutable under the hood,  
much like using the binary += instead of the ternary +.  Enumerations  
like the test in [1], below, would not be mutable unless there were  
some information such as a good consumer function that indicated the  
intermediate values were only temporarily necessary.
...
(3) So when replacing GMP with the BN library of OpenSSL (see  
http://hackage.haskell.org/trac/ghc/wiki/ReplacingGMPNotes/ 
PerformanceMeasurements), it would propably be neccessary to  
refactor the library, so custom allocation can be used as well.  
This does not seem too difficult at a first glance though.
The OpenSSL library is not GPL compatible, so there would be  
licensing problems for GPL'd system distributions; it is also  
relatively slow, though it does have a noticeably constant curve for  
exponential functions.  The one problem you will find with _all_  
potential replacement libraries is incompatible behaviour for bitwise  
functions: they are implemented arithmetically in GMP but logically  
elsewhere (when they are implemented at all).  (Note: if you are  
looking for the left-shift and right-shift operations in GMP, they  
are hidden in mpz_mul_2exp and mpz_t_div_q_2exp.)  LibTomMath, for  
example uses pure logical shifts which do not produce correct  
results.  I could go on about many other small differences but the  
end result is that you would have to do a lot of hacking to get a  
library that would replace all the functionality GMP provides.  That  
is why I started a replacement from scratch.
...
So I'd like to investigate the second or third option, as far as my  
knowledge and time permits it.
Of course it would be wise to check first if Peter Tanski is  
already/still working on a GMP replacement.
I left off working on it for some time, but things may slow down a  
little for now so I will (hopefully) have time to package it up.  I  
meant to do that more than a month ago for Thorkil, who has written a  
multi-precision integer library before and wanted to help.
...
[1]
Simple Performance Test on (ghc-darwin-i386-6.6.1):
...
test k = (iterateT k (fromIntegral (maxBound ::Int))) :: Integer  
where
  iterateT 0 v = v; iterateT k v = v `seq` iterateT (k-1) (v+10000)
The haskell function (k was taken as 10M)
triggers around k allocations and k reallocations by the gmp library.
The rough C equivalent, calling sequences of
...
malloc(3), mpz_init_set(gmp), mpz_add_ui(gmp), mpz_clear(gmp) and  
free(3),
takes more than 2 times as long, with 25% of the time spend in  
allocating and freeing pointers to gmp integers (mpz_ptr) and 50%   
of the time spend in gmp allocator functions (i.e. resizing gmp  
integers = (re)allocating limbs).
Malloc is fast but not nearly as fast as the RTS alloc functions; one  
thing I have not started is integrating the replacement library with  
GHC, mostly because the replacement library (on par or faster than  
GMP) uses SIMD functions whenever possible and they require proper  
alignment.
...
I also performed the test  with the datatype suggested by John  
Meacham  (using a gmp library with renamed symbols),
...
data FInteger = FInteger Int# (!ForeignPtr Mpz)
but it was around 8x slower, maybe due to the ForeignPtr and FFI  
overhead, or due to missing optimizations in the code.
That is quite an interesting result.  Are these "safe" foreign imports?

Cheers,
Pete