Performance of Prime Generator

Hi, I have bungled the Prime Generator on SPOJ for many times. Although the program is faster and faster on my machine, but it keeps "time limited exceeded" on SPOJ (6 seconds limit). Now my approach is for every number in the range, check if it is prime by checking if can be divided by all the primes smaller than or equal its square root. For the numbers between 999900000 and 1000000000, it take 0.64 seconds on my laptop. Could anyone give me some hints to enhance it? Thanks. === Test Data === 1 999900000 1000000000 === Test Data === === Code === {-# OPTIONS_GHC -O2 -fno-cse #-} import System.IO data Wheel a = Wheel a [a] roll :: Integral a => Wheel a -> [a] roll (Wheel n rs) = [n*k+r | k <- [0..], r <- rs] nextSize :: Integral a => Wheel a -> a -> Wheel a nextSize (Wheel n rs) p = Wheel (p*n) [r' | k <- [0..(p-1)], r <- rs, let r' = n*k+r, r' `mod` p /= 0] mkWheel :: Integral a => [a] -> Wheel a mkWheel ds = foldl nextSize (Wheel 1 [1]) ds primes :: Integral a => [a] primes = small ++ large where 1:p:candidates = roll $ mkWheel small small = [2,3,5,7] large = p : filter isPrime candidates isPrime n = all (not . divides n) $ takeWhile (\p -> p*p <= n) large divides :: Integral a => a -> a -> Bool divides n p = n `mod` p == 0 sqrt' :: Integral a => a -> a sqrt' = floor . sqrt . fromIntegral primesFromTo :: [Int] -> [Int] primesFromTo [x, y] = filter isPrime [x..y] isPrime :: Integral a => a -> Bool isPrime 1 = False isPrime 2 = True isPrime n = all (not . divides n) $ takeWhile (<= sqrt' n) primes main :: IO () main = do count <- fmap read getLine inputLines <- fmap (take count . lines) getContents let answers = map (primesFromTo . map read . words) inputLines putStr . unlines . map (unlines . map show) $ answers === Code === Best regards, Zhi-Qiang Lei zhiqiang.lei@gmail.com

On Sat, Jan 21, 2012 at 4:27 PM, Zhi-Qiang Lei
Could anyone give me some hints to enhance it? Thanks.
You could use a faster primality test algorithm such as Miller-Rabin [1]. [1] http://en.wikipedia.org/wiki/Miller–Rabin_primality_test

Thanks, I'll look into it. On Jan 21, 2012, at 4:50 PM, Yucheng Zhang wrote:
On Sat, Jan 21, 2012 at 4:27 PM, Zhi-Qiang Lei
wrote: Could anyone give me some hints to enhance it? Thanks.
You could use a faster primality test algorithm such as Miller-Rabin [1].
[1] http://en.wikipedia.org/wiki/Miller–Rabin_primality_test
Best regards, Zhi-Qiang Lei zhiqiang.lei@gmail.com

Zhi-Qiang Lei
I have bungled the Prime Generator on SPOJ for many times. Although the program is faster and faster on my machine, but it keeps "time limited exceeded" on SPOJ (6 seconds limit). Now my approach is for every number in the range, check if it is prime by checking if can be divided by all the primes smaller than or equal its square root. For the numbers between 999900000 and 1000000000, it take 0.64 seconds on my laptop. Could anyone give me some hints to enhance it? Thanks.
Although intuitively that sounds like a great idea, in practice it's not. To quickly generate a large number of small primes you should either use a sieve or use the more naive wheel trial division. This is for small numbers up to perhaps 2^18 or 2^20, where this method is feasible and usually fastest. Greets, Ertugrul -- nightmare = unsafePerformIO (getWrongWife >>= sex) http://ertes.de/

Hi, what do you mean "more naive wheel trial division"? On Jan 21, 2012, at 5:44 PM, Ertugrul Söylemez wrote:
Zhi-Qiang Lei
wrote: I have bungled the Prime Generator on SPOJ for many times. Although the program is faster and faster on my machine, but it keeps "time limited exceeded" on SPOJ (6 seconds limit). Now my approach is for every number in the range, check if it is prime by checking if can be divided by all the primes smaller than or equal its square root. For the numbers between 999900000 and 1000000000, it take 0.64 seconds on my laptop. Could anyone give me some hints to enhance it? Thanks.
Although intuitively that sounds like a great idea, in practice it's not. To quickly generate a large number of small primes you should either use a sieve or use the more naive wheel trial division. This is for small numbers up to perhaps 2^18 or 2^20, where this method is feasible and usually fastest.
Greets, Ertugrul
-- nightmare = unsafePerformIO (getWrongWife >>= sex) http://ertes.de/ _______________________________________________ Beginners mailing list Beginners@haskell.org http://www.haskell.org/mailman/listinfo/beginners

Zhi-Qiang Lei
Hi, what do you mean "more naive wheel trial division"?
Just use a wheel like 2*3*5 and don't bother only dividing by primes. Trying to divide only by primes takes more time than just living with the few percent nonprimes that slip through the wheel. Or better yet, use a sieve like the Sieve of Eratosthenes or the Sieve of Atkin. Those are fairly easy to implement in Haskell. Greets, Ertugrul -- nightmare = unsafePerformIO (getWrongWife >>= sex) http://ertes.de/

Well, thanks, so far I have tried wheel only and Sieve of Eratosthenes only approaches. They both failed. When the numbers is between 999900000 and 1000000000, they can take more than a hour on my laptop. On Jan 21, 2012, at 10:38 PM, Ertugrul Söylemez wrote:
Zhi-Qiang Lei
wrote: Hi, what do you mean "more naive wheel trial division"?
Just use a wheel like 2*3*5 and don't bother only dividing by primes. Trying to divide only by primes takes more time than just living with the few percent nonprimes that slip through the wheel.
Or better yet, use a sieve like the Sieve of Eratosthenes or the Sieve of Atkin. Those are fairly easy to implement in Haskell.
Greets, Ertugrul
-- nightmare = unsafePerformIO (getWrongWife >>= sex) http://ertes.de/ _______________________________________________ Beginners mailing list Beginners@haskell.org http://www.haskell.org/mailman/listinfo/beginners
Best regards, Zhi-Qiang Lei zhiqiang.lei@gmail.com

Zhi-Qiang Lei
Well, thanks, so far I have tried wheel only and Sieve of Eratosthenes only approaches. They both failed. When the numbers is between 999900000 and 1000000000, they can take more than a hour on my laptop.
Indeed, you're right. The method is too slow for numbers of that scale. I suggest trying the Sieve of Atkin, which performs quadratically faster than the Sieve of Eratosthenes. I haven't tried it, but an equivalent C implementation can easily compute the first 10^9 prime numbers within seconds. Greets, Ertugrul -- nightmare = unsafePerformIO (getWrongWife >>= sex) http://ertes.de/

On Saturday 21 January 2012, 23:18:23, Ertugrul Söylemez wrote:
Zhi-Qiang Lei
wrote: Well, thanks, so far I have tried wheel only and Sieve of Eratosthenes only approaches. They both failed. When the numbers is between 999900000 and 1000000000, they can take more than a hour on my laptop.
They shouldn't. But you have to use the right data structures. For a prime sieve, you need unboxed mutable arrays if you want it to be fast. You can use STUArrays from Data.Array.ST or mutable unboxed vectors from Data.Vector.Mutable.Unboxed. In principle you could also do the sieving in C and use the FFI, but afair SPOJ only accepts single source files. A decent sieve should sieve up to 10^9 in less than 30 seconds. A good sieve less than 10. But that's still too slow, there is a trick you have to use. To find the primes in the range from m to n, you don't need to find all primes below m. (No more hints to not completely spoil the problem.)
Indeed, you're right. The method is too slow for numbers of that scale.
No, works fine with the right trick.
I suggest trying the Sieve of Atkin, which performs quadratically faster than the Sieve of Eratosthenes.
No, the complexities differ by a factor of about log (log n) (details depend on the implementations). Without the hinted-at trick, an Atkin sieve would also need to be heavily optimised to stand a chance of meeting the time limit. Don't forget that the machines used by SPOJ are slow (and on top of that, SPOJ uses ghc-6.10, which is far less good at optimising such computations than 6.12 and the 7.x series).
I haven't tried it, but an equivalent C implementation can easily compute the first 10^9 prime numbers within seconds.
You mean the primes up to 10^9 here? For the heavily optimised primegen by D. Bernstein, the first 10^9 primes take 31.6 seconds here when customised to exploit the full L1 cache; 46 seconds with the vanilla block size. My segmented Eratosthenes sieve takes 83.5 seconds for that (but it overtakes primegen for somewhat higher limits). For the primes up to 10^9, the times are 0.675/1.378s for primegen, 3.29s for Eratosthenes. Based on my (limited) experience with the SPOJ machines, I'd estimate that primegen would take about 10 seconds there to sieve up to 10^9. Cheers, Daniel

Daniel Fischer
Well, thanks, so far I have tried wheel only and Sieve of Eratosthenes only approaches. They both failed. When the numbers is between 999900000 and 1000000000, they can take more than a hour on my laptop.
They shouldn't. But you have to use the right data structures.
For a prime sieve, you need unboxed mutable arrays if you want it to be fast. You can use STUArrays from Data.Array.ST or mutable unboxed vectors from Data.Vector.Mutable.Unboxed.
That's what I've used. You find the code on hpaste [1]. It's a carefully optimized Sieve of Eratosthenes and needs around 20 seconds for the primes up to 10^9. See the refinement in the annotation, which I've added just now. Before that it took around 35 seconds. I considered that to be "too slow". [1] http://hpaste.org/56866
I haven't tried it, but an equivalent C implementation can easily compute the first 10^9 prime numbers within seconds.
You mean the primes up to 10^9 here?
Yes, sorry. And I was referring to the Sieve of Atkin there, but you are right. That one is only slightly faster. Greets, Ertugrul -- nightmare = unsafePerformIO (getWrongWife >>= sex) http://ertes.de/

Sorry for the late reply, haven't checked my mail for a while. On Tuesday 24 January 2012, 13:50:32, Ertugrul Söylemez wrote:
Daniel Fischer
wrote: Well, thanks, so far I have tried wheel only and Sieve of Eratosthenes only approaches. They both failed. When the numbers is between 999900000 and 1000000000, they can take more than a hour on my laptop.
They shouldn't. But you have to use the right data structures.
For a prime sieve, you need unboxed mutable arrays if you want it to be fast. You can use STUArrays from Data.Array.ST or mutable unboxed vectors from Data.Vector.Mutable.Unboxed.
That's what I've used. You find the code on hpaste [1]. It's a carefully optimized Sieve of Eratosthenes and needs around 20 seconds for the primes up to 10^9. See the refinement in the annotation, which I've added just now. Before that it took around 35 seconds.
I considered that to be "too slow".
Right you are. Unless you need the primes to perform some time-consuming calculations with them after sieving, that is too slow. But let us compare it with a similar C implementation.
The interesting parts are:
========================
soeST :: forall s. Int -> ST s (STUArray s Int Bool)
soeST n = do
arr <- newArray (0, n) True
mapM_ (\i -> writeArray arr i False) [0, 1]
let n2 = floor (sqrt (fromIntegral n))
let loop :: Int -> ST s ()
loop i | i > n2 = return ()
loop i = do
b <- readArray arr i
let reset :: Int -> ST s ()
reset j | j > n = return ()
reset j = writeArray arr j False >> reset (j + i)
when b (reset (i*i))
loop (succ i)
loop 2
return arr
soeCount :: Int -> Int
soeCount = length . filter id . elems . soeA
========================
With ghc-7.2.2 and -O2, that took 16.8 seconds here to count the 50847534
primes up to 10^9. That's in the same ballpark as your time, maybe a bit
faster, depends on what 'around 20 seconds' means exactly.
Let's be unfriendly first:
Arracc.h:
============
#include
I haven't tried it, but an equivalent C implementation can easily compute the first 10^9 prime numbers within seconds.
You mean the primes up to 10^9 here?
Yes, sorry. And I was referring to the Sieve of Atkin there, but you are right. That one is only slightly faster.
Well, D.J. Bernstein's primegen is much faster. With the sieve size adapted to my L1 cache, it counts the primes to 10^9 in 0.68 seconds. But it's only heavily optimised for primes to 2^32. Above that, my sieve catches up (and overtakes it around 5*10^11, yay).
Greets, Ertugrul
Cheers, Daniel

Hi Guys, I just find the STUArray you are using is strange for me. Is there any tutorial for that? Google didn't help. Thanks. On Feb 5, 2012, at 10:30 AM, Daniel Fischer wrote:
Sorry for the late reply, haven't checked my mail for a while.
On Tuesday 24 January 2012, 13:50:32, Ertugrul Söylemez wrote:
Daniel Fischer
wrote: Well, thanks, so far I have tried wheel only and Sieve of Eratosthenes only approaches. They both failed. When the numbers is between 999900000 and 1000000000, they can take more than a hour on my laptop.
They shouldn't. But you have to use the right data structures.
For a prime sieve, you need unboxed mutable arrays if you want it to be fast. You can use STUArrays from Data.Array.ST or mutable unboxed vectors from Data.Vector.Mutable.Unboxed.
That's what I've used. You find the code on hpaste [1]. It's a carefully optimized Sieve of Eratosthenes and needs around 20 seconds for the primes up to 10^9. See the refinement in the annotation, which I've added just now. Before that it took around 35 seconds.
I considered that to be "too slow".
Right you are. Unless you need the primes to perform some time-consuming calculations with them after sieving, that is too slow.
But let us compare it with a similar C implementation.
The interesting parts are: ======================== soeST :: forall s. Int -> ST s (STUArray s Int Bool) soeST n = do arr <- newArray (0, n) True mapM_ (\i -> writeArray arr i False) [0, 1] let n2 = floor (sqrt (fromIntegral n))
let loop :: Int -> ST s () loop i | i > n2 = return () loop i = do b <- readArray arr i
let reset :: Int -> ST s () reset j | j > n = return () reset j = writeArray arr j False >> reset (j + i)
when b (reset (i*i))
loop (succ i)
loop 2 return arr
soeCount :: Int -> Int soeCount = length . filter id . elems . soeA ========================
With ghc-7.2.2 and -O2, that took 16.8 seconds here to count the 50847534 primes up to 10^9. That's in the same ballpark as your time, maybe a bit faster, depends on what 'around 20 seconds' means exactly.
Let's be unfriendly first:
Arracc.h: ============ #include
inline int readArray(uint64_t *arr, int n, int i); inline void writeArray(uint64_t *arr, int n, int i, int b); ============
Arracc.c: ============ #include
#include "Arracc.h" inline int readArray(uint64_t *arr, int n, int i){ if (i < 0 || i > n){ perror("Error in array index\n"); exit(EXIT_FAILURE); } return (arr[i >> 6] >> (i & 63)) & 1; }
inline void writeArray(uint64_t *arr, int n, int i, int b){ if (i < 0 || i > n){ perror("Error in array index\n"); exit(EXIT_FAILURE); } if (b) { arr[i >> 6] |= (1ull << (i&63)); } else { arr[i >> 6] &= ~(1ull << (i&63)); } } ============
main.c: ============ #include
#include #include #include #include "Arracc.h" uint64_t *soeA(int n); int pCount(uint64_t *arr, int n);
int main(int argc, char *argv[]){ int limit = (argc > 1) ? strtoul(argv[1],NULL,0) : 100000000; uint64_t *sieve = soeA(limit); printf("%d\n",pCount(sieve,limit)); free(sieve); return EXIT_SUCCESS; }
uint64_t *soeA(int n){ int s = (n >> 6)+1; uint64_t *arr = malloc(s*sizeof *arr); if (arr == NULL){ perror("Allocation failed\n"); exit(EXIT_FAILURE); } int i, j, n2 = (int)sqrt(n); for(i = 0; i < s; ++i){ arr[i] = -1; } writeArray(arr,n,0,0); writeArray(arr,n,1,0); for(i = 2; i <= n2; ++i){ if (readArray(arr,n,i)){ for(j = i*i; j <= n; j += i){ writeArray(arr,n,j,0); } } } return arr; }
int pCount(uint64_t *arr, int n){ int i, count = 0; for(i = 0; i <= n; ++i){ if (readArray(arr,n,i)){ ++count; } } return count; } ============
$ gcc -c -O3 Arracc.c $ gcc -c -O3 main.c $ gcc -O3 main.o Arracc.o -lm
Takes 18.3 seconds. Oops, slower than the Haskell programme.
Okay, ghc does some cross-module inlining and that enables further optimisations which we denied gcc here, so recompile everything with -flto additionally. That brings the C version down to 12.8 seconds.
Much better, but still not overwhelming. gcc can do a little better with everything in one file, 12.72 seconds.
We can squeeze out a little more by omitting the bounds checks for array indexing, 12.63 seconds.
But if we do the same for our Haskell programme, replace readArray/writeArray with unsafeRead/unsafeWrite, that becomes faster too, 14.3 seconds.
I have to admit that I don't know why the bounds-checking costs very little in C and a substantial amount in Haskell, but hey, who refuses free optimisations.
Lesson 1: Omit pointless array bounds checks. An array bounds check is pointless if you have just checked the validity of the index on the line above.
Where are we?
We have two not overwhelmingly fast programmes, Haskell (ghc-7.2.2) about 13% slower than C (gcc-4.5.1).
Not a bad result for ghc, but the algorithm is less than optimal.
What's the deal?
We cross off multiples of primes 2549489372 times, and we need (10^9+1)/8 bytes of memory for the sieve.
Both numbers are higher than is good for us. Since the sieve is so large and we cross off the multiples of each prime sequentially until we reach the sieve limit, we have lots of cache misses. Bad for performance. And each crossing-off takes a couple of clock cycles,
arr[i >> 6] &= ~(1ull << (i&63));
shift index, fetch value, (index & 63), shift 1, complement, and with value, store value.
I'm no CPU expert, so I don't know how many cycles that takes, but even though some of the operations can be done in parallel on modern CPUs, it's still several cycles.
Lesson 2: Avoid duplicate work and redundant data. It's easy to separate even and odd numbers, there aren't many even primes, and it's unnecessary to mark even numbers as multiples of odd primes.
Separating the crossing-off of even and odd numbers, crossing off only odd multiples of odd primes, reduces the number of crossings-off by almost 40% and the running time for the C programme to 8.63 seconds (I haven't done that for the Haskell programme).
If we go the small step further and eliminate the even numbers from the sieve completely, we reduce the required memory by half and the crossings- off by almost 60% (the fraction of eliminated crossings-off slowly decreases to 50% for higher limits, but as far as today's RAM takes us, it remains close to 60%).
At the cost of very slightly more complicated code, we can thus reduce the running time to 6.06 seconds (C) resp. 6.49 seconds (Haskell) (about 7% slower than C, not bad at all).
That's a pretty good start, for some purposes it's already good enough.
If we need more, the next step is to eliminate multiples of 3 from the sieve. That reduces the memory requirements by a further factor of 2/3, and the number of crossings-off by a further (mumble, didn't do an exact calculation, educated guess) roughly 40% (that fraction decreases to 1/3 for higher limits, but again very slowly).
The code becomes a bit more complicated again - more than in the previous step, but still fairly straightforward.
The running time reduces by another factor of 0.65-0.7 or so. We'd be in the region of 4-5 seconds then.
One can eliminate the multiples of further small primes, for each prime the additional code complexity increases and the reduction in work/running time decreases. At some point, the slowdown from the more complex code annihilates the gain from the reduced work. I stopped after eliminating multiples of 5, 7 might be worth it, but I'm not convinced.
Using a segmented sieve provides better locality at the cost of more complex code. If the sieve is small enough to fit in the L2 cache and large enough that some significant work can be done per segment, it's a net gain.
tl,dr: The naive algorithm we started from is too slow for higher limits, to get goodish performance, it has to be tweaked heavily. But Haskell plays in the same league as C (at least the C I've written).
I haven't tried it, but an equivalent C implementation can easily compute the first 10^9 prime numbers within seconds.
You mean the primes up to 10^9 here?
Yes, sorry. And I was referring to the Sieve of Atkin there, but you are right. That one is only slightly faster.
Well, D.J. Bernstein's primegen is much faster. With the sieve size adapted to my L1 cache, it counts the primes to 10^9 in 0.68 seconds. But it's only heavily optimised for primes to 2^32. Above that, my sieve catches up (and overtakes it around 5*10^11, yay).
Greets, Ertugrul
Cheers, Daniel
_______________________________________________ Beginners mailing list Beginners@haskell.org http://www.haskell.org/mailman/listinfo/beginners
Best regards, Zhi-Qiang Lei zhiqiang.lei@gmail.com

Zhi-Qiang Lei
I have bungled the Prime Generator on SPOJ for many times. Although the program is faster and faster on my machine, but it keeps "time limited exceeded" on SPOJ (6 seconds limit). Now my approach is for every number in the range, check if it is prime by checking if can be divided by all the primes smaller than or equal its square root. For the numbers between 999900000 and 1000000000, it take 0.64 seconds on my laptop. Could anyone give me some hints to enhance it? Thanks.
There are some pretty good examples of fast prime generators here: http://www.haskell.org/haskellwiki/Prime_numbers although using them for SPOJ is a bit like cheating :) I learned a lot from that page though, so I advise giving it a good readover. -- Burton Samograd http://kruhft.dyndns.org
participants (5)
-
Burton Samograd
-
Daniel Fischer
-
Ertugrul Söylemez
-
Yucheng Zhang
-
Zhi-Qiang Lei