
I'm trying to do a calculation for Gauss' circle problem, which counts the
integer lattice points with distance to the origin <= r. It's sequence
A000328 on the AT&T integer sequence database. I can't figure out a way to
do it quickly in Haskell for r around 10^9.
Here's my attempt, which takes about 75s for r=10^8.
circ2 r = (1+4*r) + 4 * (circ2' (rs+1) r 1 0)
where
rs = r^2
-- circ2' :: Int64 -> Int64 -> Int64 -> Int64 -> Int64
circ2' rad x y sum
| x

On Wed, 2009-01-28 at 16:42 -0800, drblanco wrote:
I do already have the number I wanted, but was wondering how this could be made faster, or even why it's so slow. This is all on GHC 6.8.3 under OS X Intel, using ghc -O2.
I'm not exactly sure what's different, but for me it works pretty well. I put back in the Int64 type signature.
For comparison, the C code below runs in <1 second.
You've got a faster machine than me :-) I compiled both the Haskell and C versions to standalone executables with ghc/gcc -O2 and ran them with time. C version: $ time ./circ 3141592649589764829 real 0m2.430s user 0m2.428s sys 0m0.000s Haskell version: time ./circ2 3141592653589764829 real 0m2.753s user 0m2.756s sys 0m0.000s Not too bad I'd say! :-) I was using ghc-6.10 for this test. It would appear that ghc-6.8 is a bit slower, I get: 3141592653589764829 real 0m5.767s user 0m5.768s sys 0m0.000s Now the other difference is that I'm using a 64bit machine so perhaps ghc just produces terrible code for Int64 on 32bit machines. Duncan

Did you print it? I'm using same code with ghc --make -O2 and it
takes forever to finish.
On Wed, Jan 28, 2009 at 8:06 PM, Duncan Coutts
On Wed, 2009-01-28 at 16:42 -0800, drblanco wrote:
I do already have the number I wanted, but was wondering how this could be made faster, or even why it's so slow. This is all on GHC 6.8.3 under OS X Intel, using ghc -O2.
I'm not exactly sure what's different, but for me it works pretty well. I put back in the Int64 type signature.
For comparison, the C code below runs in <1 second.
You've got a faster machine than me :-)
I compiled both the Haskell and C versions to standalone executables with ghc/gcc -O2 and ran them with time.
C version: $ time ./circ 3141592649589764829
real 0m2.430s user 0m2.428s sys 0m0.000s
Haskell version: time ./circ2 3141592653589764829
real 0m2.753s user 0m2.756s sys 0m0.000s
Not too bad I'd say! :-)
I was using ghc-6.10 for this test. It would appear that ghc-6.8 is a bit slower, I get:
3141592653589764829
real 0m5.767s user 0m5.768s sys 0m0.000s
Now the other difference is that I'm using a 64bit machine so perhaps ghc just produces terrible code for Int64 on 32bit machines.
Duncan
_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

On Wed, 2009-01-28 at 20:11 -0500, sam lee wrote:
Did you print it? I'm using same code with ghc --make -O2 and it takes forever to finish.
Yes, you can see in the output that it prints the same answer in each case. I was using r = 10^9 as you suggested.
C version: $ time ./circ 3141592649589764829
Haskell version: time ./circ2 3141592653589764829
Duncan

Duncan, I think you must have some magics -- on my machine the
original code also takes forever.
Running with +RTS -S indicates it's allocating several gig of memory
or more.
Applying some bang patterns gives me ~8s for 10^8 and somewhat more
than a minute for 10^9:
{-# LANGUAGE BangPatterns #-}
module Main where
import Data.Int
main = putStrLn $ show $ circ2 (10^8)
circ2 :: Int64 -> Int64
circ2 r = ((1+4*r) + 4 * (go (rs+1) r 1 0))
where
rs = r^2
go :: Int64 -> Int64 -> Int64 -> Int64 -> Int64
go !rad !x !y !sum
| x < y = sum
| rad <= rs = go (rad+1+2*y) x (y+1) (sum+1+2*(x-y))
| otherwise = go (rad+1-2*x) (x-1) y sum
10^8:
rmm@Hugo:~$ time ./circ-bangpatterns +RTS -t
./circ-bangpatterns +RTS -t
31415926535867961
<
On Wed, 2009-01-28 at 16:42 -0800, drblanco wrote:
I do already have the number I wanted, but was wondering how this could be made faster, or even why it's so slow. This is all on GHC 6.8.3 under OS X Intel, using ghc -O2.
I'm not exactly sure what's different, but for me it works pretty well. I put back in the Int64 type signature.
For comparison, the C code below runs in <1 second.
You've got a faster machine than me :-)
I compiled both the Haskell and C versions to standalone executables with ghc/gcc -O2 and ran them with time.
C version: $ time ./circ 3141592649589764829
real 0m2.430s user 0m2.428s sys 0m0.000s
Haskell version: time ./circ2 3141592653589764829
real 0m2.753s user 0m2.756s sys 0m0.000s
Not too bad I'd say! :-)
I was using ghc-6.10 for this test. It would appear that ghc-6.8 is a bit slower, I get:
3141592653589764829
real 0m5.767s user 0m5.768s sys 0m0.000s
Now the other difference is that I'm using a 64bit machine so perhaps ghc just produces terrible code for Int64 on 32bit machines.
Duncan
_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

Ross Mellgren schrieb:
Duncan, I think you must have some magics -- on my machine the original code also takes forever. Running with +RTS -S indicates it's allocating several gig of memory or more.
Applying some bang patterns gives me ~8s for 10^8 and somewhat more than a minute for 10^9: Hi, same here, with bang patterns ~100s / 1Mb but
The C program is quite fast:
rmm@Hugo:~$ time ./circ-orig 1302219321 looks wrong to me
real 0m1.073s
cafe(0) $ time ./gauss 3141592649589764829 real 0m17.894s So my 32-bit machine is really slow ... benedikt
On Wed, 2009-01-28 at 16:42 -0800, drblanco wrote:
I do already have the number I wanted, but was wondering how this could be made faster, or even why it's so slow. This is all on GHC 6.8.3 under OS X Intel, using ghc -O2.
I'm not exactly sure what's different, but for me it works pretty well. I put back in the Int64 type signature.
For comparison, the C code below runs in <1 second.
You've got a faster machine than me :-)
I compiled both the Haskell and C versions to standalone executables with ghc/gcc -O2 and ran them with time.
C version: $ time ./circ 3141592649589764829
real 0m2.430s user 0m2.428s sys 0m0.000s
Haskell version: time ./circ2 3141592653589764829
real 0m2.753s user 0m2.756s sys 0m0.000s
Not too bad I'd say! :-)
I was using ghc-6.10 for this test. It would appear that ghc-6.8 is a bit slower, I get:
3141592653589764829
real 0m5.767s user 0m5.768s sys 0m0.000s
Now the other difference is that I'm using a 64bit machine so perhaps ghc just produces terrible code for Int64 on 32bit machines.
Duncan
_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

Yeah, you know after sending the email (never a better time) I noticed that the C version wasn't spitting out the right answer. I'm not really sure why, I just replaced bigint with int64_t from stdint.h. -Ross On Jan 28, 2009, at 8:32 PM, Benedikt Huber wrote:
Ross Mellgren schrieb:
Duncan, I think you must have some magics -- on my machine the original code also takes forever. Running with +RTS -S indicates it's allocating several gig of memory or more. Applying some bang patterns gives me ~8s for 10^8 and somewhat more than a minute for 10^9: Hi, same here, with bang patterns ~100s / 1Mb but
The C program is quite fast: rmm@Hugo:~$ time ./circ-orig 1302219321 looks wrong to me real 0m1.073s
cafe(0) $ time ./gauss 3141592649589764829 real 0m17.894s
So my 32-bit machine is really slow ...
benedikt
On Wed, 2009-01-28 at 16:42 -0800, drblanco wrote:
I do already have the number I wanted, but was wondering how this could be made faster, or even why it's so slow. This is all on GHC 6.8.3 under OS X Intel, using ghc -O2.
I'm not exactly sure what's different, but for me it works pretty well. I put back in the Int64 type signature.
For comparison, the C code below runs in <1 second.
You've got a faster machine than me :-)
I compiled both the Haskell and C versions to standalone executables with ghc/gcc -O2 and ran them with time.
C version: $ time ./circ 3141592649589764829
real 0m2.430s user 0m2.428s sys 0m0.000s
Haskell version: time ./circ2 3141592653589764829
real 0m2.753s user 0m2.756s sys 0m0.000s
Not too bad I'd say! :-)
I was using ghc-6.10 for this test. It would appear that ghc-6.8 is a bit slower, I get:
3141592653589764829
real 0m5.767s user 0m5.768s sys 0m0.000s
Now the other difference is that I'm using a 64bit machine so perhaps ghc just produces terrible code for Int64 on 32bit machines.
Duncan
_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

On Wed, 2009-01-28 at 20:23 -0500, Ross Mellgren wrote:
Duncan, I think you must have some magics -- on my machine the original code also takes forever. Running with +RTS -S indicates it's allocating several gig of memory or more.
It runs in a tiny heap for me: ./circ2 +RTS -A10k -M20k -RTS 3141592653589764829 +RTS -S indicates it allocated 8,512 bytes.
Applying some bang patterns gives me ~8s for 10^8 and somewhat more than a minute for 10^9:
I wonder if it is the 64 vs 32 bit machine difference. I recall some ghc ticket about Int64 having rather poor performance on 32bit x86 machines. I don't have a 32bit x86 machine to hand so I cannot easily test it. Duncan

Thanks for the help. It's clear in retrospect that it was being too
lazy, but not why changing to Int64 did it. The bang patterns made
the difference between runnable and not.
GHC 6.10 didn't make much of a difference, but there's no 64-bit build
for the Mac. If this seems to come up again I'll have to set up
64-bit Linux in a virtual machine, which seems a bit convoluted.
David
On Wed, Jan 28, 2009 at 7:37 PM, Duncan Coutts
On Wed, 2009-01-28 at 20:23 -0500, Ross Mellgren wrote:
Duncan, I think you must have some magics -- on my machine the original code also takes forever. Running with +RTS -S indicates it's allocating several gig of memory or more.
It runs in a tiny heap for me:
./circ2 +RTS -A10k -M20k -RTS 3141592653589764829
+RTS -S indicates it allocated 8,512 bytes.
Applying some bang patterns gives me ~8s for 10^8 and somewhat more than a minute for 10^9:
I wonder if it is the 64 vs 32 bit machine difference. I recall some ghc ticket about Int64 having rather poor performance on 32bit x86 machines. I don't have a 32bit x86 machine to hand so I cannot easily test it.
Duncan

Ross Mellgren wrote:
Duncan, I think you must have some magics -- on my machine the original code also takes forever. Running with +RTS -S indicates it's allocating several gig of memory or more.
Applying some bang patterns gives me ~8s for 10^8 and somewhat more than a minute for 10^9
It works great for me. 64 bit, GHC 6.10.1, no bang patterns or other magic. Works about the same with both Int and Int64. % time ./ctest 3141592649589764829 real 0m2.614s user 0m2.610s sys 0m0.003s % time ./hstest 3141592653589764829 real 0m3.878s user 0m3.870s sys 0m0.003s % ./hstest +RTS -S ./hstest +RTS -S Alloc Copied Live GC GC TOT TOT Page Flts bytes bytes bytes user elap user elap 3141592653589764829 8512 688 17136 0.00 0.00 3.94 3.94 0 0 (Gen: 1) 0 0.00 0.00 8,512 bytes allocated in the heap 688 bytes copied during GC 17,136 bytes maximum residency (1 sample(s)) 19,728 bytes maximum slop 1 MB total memory in use (0 MB lost due to fragmentation) Generation 0: 0 collections, 0 parallel, 0.00s, 0.00s elapsed Generation 1: 1 collections, 0 parallel, 0.00s, 0.00s elapsed INIT time 0.00s ( 0.00s elapsed) MUT time 3.94s ( 3.94s elapsed) GC time 0.00s ( 0.00s elapsed) EXIT time 0.00s ( 0.00s elapsed) Total time 3.94s ( 3.94s elapsed) %GC time 0.0% (0.0% elapsed) Alloc rate 2,158 bytes per MUT second Productivity 99.9% of total user, 100.0% of total elapsed

Apparently 64-bit GHC is sufficiently advanced to be indistinguishable from magic. Now, if only there was a 64-bit binary for Mac OS X :-/ -Ross On Jan 28, 2009, at 9:06 PM, Jake McArthur wrote:
Ross Mellgren wrote:
Duncan, I think you must have some magics -- on my machine the original code also takes forever. Running with +RTS -S indicates it's allocating several gig of memory or more. Applying some bang patterns gives me ~8s for 10^8 and somewhat more than a minute for 10^9
It works great for me. 64 bit, GHC 6.10.1, no bang patterns or other magic. Works about the same with both Int and Int64.
% time ./ctest 3141592649589764829
real 0m2.614s user 0m2.610s sys 0m0.003s
% time ./hstest 3141592653589764829
real 0m3.878s user 0m3.870s sys 0m0.003s
% ./hstest +RTS -S ./hstest +RTS -S Alloc Copied Live GC GC TOT TOT Page Flts bytes bytes bytes user elap user elap 3141592653589764829 8512 688 17136 0.00 0.00 3.94 3.94 0 0 (Gen: 1) 0 0.00 0.00
8,512 bytes allocated in the heap 688 bytes copied during GC 17,136 bytes maximum residency (1 sample(s)) 19,728 bytes maximum slop 1 MB total memory in use (0 MB lost due to fragmentation)
Generation 0: 0 collections, 0 parallel, 0.00s, 0.00s elapsed Generation 1: 1 collections, 0 parallel, 0.00s, 0.00s elapsed
INIT time 0.00s ( 0.00s elapsed) MUT time 3.94s ( 3.94s elapsed) GC time 0.00s ( 0.00s elapsed) EXIT time 0.00s ( 0.00s elapsed) Total time 3.94s ( 3.94s elapsed)
%GC time 0.0% (0.0% elapsed)
Alloc rate 2,158 bytes per MUT second
Productivity 99.9% of total user, 100.0% of total elapsed

On Wed, Jan 28, 2009 at 04:42:49PM -0800, drblanco wrote:
Here's my attempt, which takes about 75s for r=10^8.
circ2 r = (1+4*r) + 4 * (circ2' (rs+1) r 1 0) where rs = r^2 circ2' rad x y sum | x
For comparison, the C code below runs in <1 second.
typedef unsigned long long bigint;
bigint gausscount(bigint r) { bigint sum=0; bigint x, y; bigint rs=r*r; bigint rad=rs+1; x=r; y=1;
while (y
rs) { rad-=2*x; rad++; x--; } sum+=1+2*(x-y); rad+=2*y+1; y++; } sum*=4; sum++;
return sum; }
for the curious, here is the C output of jhc when fed circ2 annotated with the obvious bang-patterns (the strictness analyzer still needs some help): other than giving every intermediate value a name, and only performing one operation a line (which doesn't bother gcc at all), it is pretty darn close to the hand optimized C code... variable guide: x -> v86 y -> v88 rs -> v4146 rad -> v84 and sum is accumulated in v90
static uint64_t circ2(uint64_t v12) { uint64_t v212; uint64_t v4146 = (v12 * v12); uint64_t v4758 = (4 * v12); uint64_t v4786 = (1 + v4758); uint64_t v4160 = (1 + v4146); uint64_t v84 = v4160; uint64_t v86 = v12; uint64_t v88 = 1; uint64_t v90 = 0; fW_a__fMain__3_ucirc2_t_u2:; { uint16_t v100000 = (((int64_t)v86) < ((int64_t)v88)); if (0 == v100000) { uint16_t v100002 = (((int64_t)v84) <= ((int64_t)v4146)); if (0 == v100002) { uint64_t v4338 = (1 + v84); uint64_t v4352 = (2 * v86); uint64_t v4366 = (v4338 - v4352); uint64_t v4380 = (v86 - 1); v84 = v4366; v86 = v4380; v88 = v88; v90 = v90; goto fW_a__fMain__3_ucirc2_t_u2; } else { /* 1 */ uint64_t v4226 = (1 + v90); uint64_t v4772 = (v86 - v88); uint64_t v4282 = (2 * v4772); uint64_t v4296 = (v4226 + v4282); uint64_t v4310 = (1 + v88); uint64_t v4240 = (1 + v84); uint64_t v4254 = (2 * v88); uint64_t v4324 = (v4240 + v4254); v84 = v4324; v86 = v86; v88 = v4310; v90 = v4296; goto fW_a__fMain__3_ucirc2_t_u2; } } else { /* 1 */ v212 = v90; } } uint64_t v4174 = (4 * v212); return v4786 + v4174; }
John -- John Meacham - ⑆repetae.net⑆john⑈
participants (8)
-
Benedikt Huber
-
David White
-
drblanco
-
Duncan Coutts
-
Jake McArthur
-
John Meacham
-
Ross Mellgren
-
sam lee