FASTER primes (was: Re: Code and Perf. Data for Prime Finders (was: Genuine Eratosthenes sieve))

apfelmus
~~ This is a repost, with apologies to anyone who sees this twice (I've replied to a two years old thread, and it doesn't show up in GMANE as I thought it would). ~~
Dave Bayer wrote:
What I'm calling a "venturi"
venturi :: Ord a => [[a]] -> [a]
merges an infinite list of infinite lists into one list, under the assumption that each list, and the heads of the lists, are in increasing order.
I wrote this as an experiment in coding data structures in Haskell's lazy evaluation model, rather than as explicit data. The majority of the work done by this code is done by "merge"; the multiples of each prime percolate up through a tournament consisting of a balanced tree of suspended "merge" function calls. In order to create an infinite lazy balanced tree of lists, the datatype
data List a = A a (List a) | B [a]
is used as scaffolding. One thinks of the root of the infinite tree as starting at the leftmost child, and crawling up the left spine as necessary.
After some pondering, the List a data structure for merging is really ingenious! :) Here's a try to explain how it works:
The task is to merge a number of sorted and infinite lists when it's known that their heads are in increasing order. In particular, we want to write
primes = (2:) $ diff [3..] $ venturi $ map multiple primes
Thus, we have a (maybe infinite) list
xss = [xs1, xs2, xs3, ...]
of infinite lists with the following properties
all sorted xss sorted (map head xss)
where sorted is a function that returns True if the argument is a sorted list. A first try to implement the merging function is
venturi xss = foldr1 merge xss = xs1 `merge` (xs2 `merge` (xs3 `merge` ...
where merge is the standard function to merge to sorted lists.
However, there are two problems. The first problem is that this doesn't work for infinite lists since merge is strict in both arguments. But the property head xs1 < head xs2 < head xs3 < ... we missed to exploit yet can now be used in the following way
venturi xss = foldr1 merge' xss
merge' (x:xt) ys = x : merge xt ys
In other words, merge' is biased towards the left element
merge' (x:_|_) _|_ = x : _|_
which is correct since we know that (head xs < head ys).
The second problem is that we want the calls to merge to be arranged as a balanced binary tree since that gives an efficient heap. It's not so difficult to find a good shape for the infinite tree, the real problem is to adapt merge' to this situation since it's not associative:
......
The problem is that the second form doesn't realize that y is also smaller than the third argument. In other words, the second form has to treat more than one element as "privileged", namely x1,x2,... and y. This can be done with the aforementioned list data structure
data People a = VIP a (People a) | Crowd [a]
The people (VIPs and crowd) are assumed to be _sorted_. Now, we can start to implement
merge' :: Ord a => People a -> People a -> People a
Hi, ... replying to a two-years-old post here, :) :) and after consulting the full "VIP" version in haskellwiki/Prime_Numers#Implicit_Heap ... It is indeed the major problem with the merged multiples removing code (similar one to Richard Bird's code from Melissa O'Neill's JFP article) - the linear nature of foldr, requiring an (:: a->b->b) merge function. To make it freely composable to rearrange the list into arbitrary form tree it must indeed be type uniform (:: a->a->a) first, and associative second. The structure of the folded tree should be chosen to better suit the primes multiples production. I guestimate the total cost as Sum (1/p)*d, where p is a generating prime at the leaf, and d the leaf's depth, i.e. the amount of merge nodes its produced multiple must pass on its way to the top. The structure used in your VIP code, 1+(2+(4+(8+...))), can actually be improved upon with another, (2+4)+( (4+8)+( (8+16)+...)), for which the estimated cost is about 10%-12% lower. This can be expressed concisely as the following: primes :: () -> [Integer] primes () = 2:primes' where primes' = [3,5] ++ drop 2 [3,5..] `minus` comps mults = map (\p-> fromList [p*p,p*p+2*p..]) $ primes' (comps,_) = tfold mergeSP (pairwise mergeSP mults) fromList (x:xs) = ([x],xs) tfold f (a: ~(b: ~(c:xs))) = (a `f` (b `f` c)) `f` tfold f (pairwise f xs) pairwise f (x:y:ys) = f x y : pairwise f ys mergeSP (a,b) ~(c,d) = let (bc,b') = spMerge b c in (a ++ bc, merge b' d) where spMerge u@(x:xs) w@(y:ys) = case compare x y of LT -> (x:c,d) where (c,d) = spMerge xs w EQ -> (x:c,d) where (c,d) = spMerge xs ys GT -> (y:c,d) where (c,d) = spMerge u ys spMerge u [] = ([], u) spMerge [] w = ([], w) with ''merge'' and ''minus'' defined in the usual way. Its run times are indeed improved 10%-12% over the VIP code from the haskellwiki page. Testing was done by running the code, interpreted, inside GHCi. The ordered "split pairs" representing ordered lists here as pairs of a known, finite (so far) prefix and the rest of list, form a _monoid_ under mergeSP. Or with wheel, primes :: () -> [Integer] primes () = 2:3:5:7:primes' where primes' = [11,13] ++ drop 2 (rollFrom 11) `minus` comps mults = map (\p-> fromList $ map (p*) $ rollFrom p) $ primes' (comps,_) = tfold mergeSP (pairwise mergeSP mults) fromList (x:xs) = ([x],xs) rollFrom n = let x = (n-11) `mod` 210 (y,_) = span (< x) wheelSums in roll n $ drop (length y) wheel wheelSums = roll 0 wdiffs roll = scanl (+) wheel = wdiffs ++ wheel wdiffs = 2:4:2:4:6:2:6:4:2:4:6:6:2:6:4:2:6:4:6:8:4:2:4:2: 4:8:6:4:6:2:4:6:2:6:6:4:2:4:6:2:6:4:2:4:2:10:2:10:wdiffs Now _this_, when tested as interpreted code in GHCi, runs about 2.5x times faster than Priority Queue based code from Melissa O'Neill's ZIP package mentioned at the haskellwiki/Prime_Numbers page, with about half used memory reported, in producing 10,000 to 300,000 primes. It is faster than BayerPrimes.hs from the ZIP package too, in the tested range, at about 35 lines of code in total. ~~ This is a repost, with apologies to anyone who sees this twice (I've replied to a two years old thread, and it doesn't show up in GMANE as I thought it would). ~~

Will Ness
wheelSums = roll 0 wdiffs roll = scanl (+) wheel = wdiffs ++ wheel wdiffs = 2:4:2:4:6:2:6:4:2:4:6:6:2:6:4:2:6:4:6:8:4:2:4:2: 4:8:6:4:6:2:4:6:2:6:6:4:2:4:6:2:6:4:2:4:2:10:2:10:wdiffs
Apparently that works too, but I meant it to be: wdiffs = 2:4:2:4:6:2:6:4:2:4:6:6:2:6:4:2:6:4:6:8:4:2:4:2: 4:8:6:4:6:2:4:6:2:6:6:4:2:4:6:2:6:4:2:4:2:10:2:10:[] :)

Am Dienstag 29 Dezember 2009 04:38:21 schrieb Will Ness:
Now _this_, when tested as interpreted code in GHCi, runs about 2.5x times faster than Priority Queue based code from Melissa O'Neill's ZIP package mentioned at the haskellwiki/Prime_Numbers page, with about half used memory reported, in producing 10,000 to 300,000 primes.
It is faster than BayerPrimes.hs from the ZIP package too, in the tested range, at about 35 lines of code in total.
That's nice. However, the important criterion is how compiled code (-O2) fares. Do the relations continue to hold? How does it compare to a bitsieve?

Daniel Fischer
Am Dienstag 29 Dezember 2009 04:38:21 schrieb Will Ness:
Now _this_, when tested as interpreted code in GHCi, runs about 2.5x times faster than Priority Queue based code from Melissa O'Neill's ZIP package mentioned at the haskellwiki/Prime_Numbers page, with about half used memory reported, in producing 10,000 to 300,000 primes.
It is faster than BayerPrimes.hs from the ZIP package too, in the tested range, at about 35 lines of code in total.
That's nice. However, the important criterion is how compiled code (-O2)
fares. Do the relations continue to hold? How does it compare to a bitsieve? Haven't gotten to that part yet. :) But why is it more important? Would that not tell us more about the compiler performance than the code itself? This code is just an endpoint (so far) in a short procession of natural stepwise development of the famous classic Turner's sieve, through the "postponed filters", through to Euler's sieve, the merging sieve (i.e. Richard Bird's) and on to the tree-fold merging, with wheel. I just wanted to see where the simple "normal" (i.e. _beginner_-friendly) functional code can get, in a natural way. It's not about writing the fastest code in _advanced_ Haskell. It's about having clear and simple code that can be understood at a glance - i.e. contributes to our understanding of a problem - faithfully reflecting its essential elements, and because of _that_, fast. It's kind of like _not_ using mutable arrays in a quicksort. Seeing claims that it's _either_ Turner's _or_ the PQ-based code didn't feel right to me somehow, especially the claim that going by primes squares is "a pleasing but minor optimization", what with the postponed filters (which serves as the framework for all the other variants) achieving the orders of magnitude speedup and cutting the Turner's O(n^2) right down to O(n^1.5) just by doing that squares optimization (with the final version hovering around 1.24..1.17 in the tested range). The Euler's sieve being a special case of Eratosthenes's, too, doesn't let credence to claims that only the PQ version is somehow uniquely authentic and "faithful" to it. Turner's sieve should have been always looked at as just a specification, not a code, anyway, and actually running it is ridiculous. Postponed filters version, is the one to be used as a reference point of the basic _code_, precisely because it _does_ use the primes squares optimization, which _is_ essential to any basic sieve.
_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe <at> haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

2009/12/29 Will Ness
Daniel Fischer
writes: Am Dienstag 29 Dezember 2009 04:38:21 schrieb Will Ness:
Now _this_, when tested as interpreted code in GHCi, runs about 2.5x times faster than Priority Queue based code from Melissa O'Neill's ZIP package mentioned at the haskellwiki/Prime_Numbers page, with about half used memory reported, in producing 10,000 to 300,000 primes.
It is faster than BayerPrimes.hs from the ZIP package too, in the tested range, at about 35 lines of code in total.
That's nice. However, the important criterion is how compiled code (-O2)
fares. Do the relations continue to hold? How does it compare to a bitsieve?
Haven't gotten to that part yet. :)
But why is it more important? Would that not tell us more about the compiler performance than the code itself?
If you mean "algorithmic complexity", you shouldn't care about a difference of 2.5x. If you mean "actual performance for a particular task", you should measure the performance in realistic conditions. Namely, if you're implementing a program that needs efficient generation of primes, won't you compile it with -O2?
This code is just an endpoint (so far) in a short procession of natural stepwise development of the famous classic Turner's sieve, through the "postponed filters", through to Euler's sieve, the merging sieve (i.e. Richard Bird's) and on to the tree-fold merging, with wheel. I just wanted to see where the simple "normal" (i.e. _beginner_-friendly) functional code can get, in a natural way.
It's not about writing the fastest code in _advanced_ Haskell. It's about having clear and simple code that can be understood at a glance - i.e. contributes to our understanding of a problem - faithfully reflecting its essential elements, and because of _that_, fast. It's kind of like _not_ using mutable arrays in a quicksort.
Seeing claims that it's _either_ Turner's _or_ the PQ-based code didn't feel right to me somehow, especially the claim that going by primes squares is "a pleasing but minor optimization", what with the postponed filters (which serves as the framework for all the other variants) achieving the orders of magnitude speedup and cutting the Turner's O(n^2) right down to O(n^1.5) just by doing that squares optimization (with the final version hovering around 1.24..1.17 in the tested range). The Euler's sieve being a special case of Eratosthenes's, too, doesn't let credence to claims that only the PQ version is somehow uniquely authentic and "faithful" to it.
Turner's sieve should have been always looked at as just a specification, not a code, anyway, and actually running it is ridiculous. Postponed filters version, is the one to be used as a reference point of the basic _code_, precisely because it _does_ use the primes squares optimization, which _is_ essential to any basic sieve.
_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe <at> haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
-- Eugene Kirpichov Web IR developer, market.yandex.ru

Eugene Kirpichov
2009/12/29 Will Ness
: Daniel Fischer
writes: Am Dienstag 29 Dezember 2009 04:38:21 schrieb Will Ness:
Now _this_, when tested as interpreted code in GHCi, runs about 2.5x times faster than Priority Queue based code from Melissa O'Neill's ZIP package mentioned at the haskellwiki/Prime_Numbers page, with about half used memory reported, in producing 10,000 to 300,000 primes.
It is faster than BayerPrimes.hs from the ZIP package too, in the tested range, at about 35 lines of code in total.
That's nice. However, the important criterion is how compiled code (-O2)
fares. Do the relations continue to hold? How does it compare to a bitsieve?
Haven't gotten to that part yet. :)
But why is it more important? Would that not tell us more about the compiler performance than the code itself?
If you mean "algorithmic complexity", you shouldn't care about a difference of 2.5x.
It's not just at one point; the asymptotics are _the_same_ across the range that I've tested (admittedly, somewhat narrow). I measure local behavior simply as logBase in base of ratio of problem sizes, of the ratio of run times.
If you mean "actual performance for a particular task", you should measure the performance in realistic conditions. Namely, if you're implementing a program that needs efficient generation of primes, won't you compile it with -O2?
If I realistically needed primes generated in a real life setting, I'd probably had to use some C for that. If OTOH we're talking about a tutorial code that is to be as efficient as possible without loosing it clarity, being a reflection of essentials of the problem, then any overly complicated advanced Haskell wouldn't be my choice either. And seeing that this overly-complicated (IMO), steps-jumping PQ-based code was sold to us as the only "faithful" rendering of the sieve, I wanted to see for myself whether this really holds water.

Gee, seems my mail server reads your posts very thoroughly today :) Am Dienstag 29 Dezember 2009 14:58:10 schrieb Will Ness:
Eugene Kirpichov
writes: 2009/12/29 Will Ness
: Daniel Fischer
writes: Am Dienstag 29 Dezember 2009 04:38:21 schrieb Will Ness:
Now _this_, when tested as interpreted code in GHCi, runs about 2.5x times faster than Priority Queue based code from Melissa O'Neill's ZIP package mentioned at the haskellwiki/Prime_Numbers page, with about half used memory reported, in producing 10,000 to 300,000 primes.
It is faster than BayerPrimes.hs from the ZIP package too, in the tested range, at about 35 lines of code in total.
That's nice. However, the important criterion is how compiled code (-O2)
fares. Do the relations continue to hold? How does it compare to a bitsieve?
Haven't gotten to that part yet. :)
But why is it more important? Would that not tell us more about the compiler performance than the code itself?
If you mean "algorithmic complexity", you shouldn't care about a difference of 2.5x.
It's not just at one point; the asymptotics are _the_same_ across the range that I've tested (admittedly, somewhat narrow). I measure local behavior simply as logBase in base of ratio of problem sizes, of the ratio of run times.
If you mean "actual performance for a particular task", you should measure the performance in realistic conditions. Namely, if you're implementing a program that needs efficient generation of primes, won't you compile it with -O2?
If I realistically needed primes generated in a real life setting, I'd probably had to use some C for that.
If you need the utmost speed, then probably yes. If you can do with a little less, my STUArray bitsieves take about 35% longer than the equivalent C code and are roughly eight times faster than ONeillPrimes. I can usually live well with that.
If OTOH we're talking about a tutorial code that is to be as efficient as possible without loosing it clarity, being a reflection of essentials of the problem, then any overly complicated advanced Haskell wouldn't be my choice either.
+1 Though perhaps we view mutable array code differently. In my view, it's neither advanced nor complicated.
And seeing that this overly-complicated (IMO), steps-jumping PQ-based code was sold to us as the only "faithful" rendering of the sieve, I wanted to see for myself whether this really holds water.
I can understand that very well.

Daniel Fischer
Gee, seems my mail server reads your posts very thoroughly today :)
I hope it's not a bad thing. :)
Am Dienstag 29 Dezember 2009 14:58:10 schrieb Will Ness:
If I realistically needed primes generated in a real life setting, I'd probably had to use some C for that.
If you need the utmost speed, then probably yes. If you can do with a little less, my STUArray bitsieves take about 35% longer than the equivalent C code >
and are roughly eight times faster than ONeillPrimes. I can usually live well > with that. Wow! That's fast! (that's the code from haskellwiki's primes page, right?)
If OTOH we're talking about a tutorial code that is to be as efficient as possible without loosing it clarity, being a reflection of essentials of the problem, then any overly complicated advanced Haskell wouldn't be my choice either.
+1 Though perhaps we view mutable array code differently. In my view, it's neither advanced nor complicated.
convoluted, then. Not using "higher level" concepts, like map and foldr, :) or head.until isSingleton (pairwise merge).map wrap , that kind of thing. :) BTW I think a really smart compiler should just get a specification, like Turner's sieve, and just derive a good code from that by itself. Another example would be qq n m = sum $ take n $ cycle [1..m] which should really get compiled into just a math formula, IMO. Now _that_ I would call a good compiler. That way I really won't have to learn how to use STUArray`s you see. I've seen this question asked a lot, what should be a good programming language? IMO, English (plus math where needed, and maybe some sketches by hand). :)
And seeing that this overly-complicated (IMO), steps-jumping PQ-based code was sold to us as the only "faithful" rendering of the sieve, I wanted to see for myself whether this really holds water.
I can understand that very well.
:)

Am Mittwoch 30 Dezember 2009 01:23:32 schrieb Will Ness:
Daniel Fischer
writes: Gee, seems my mail server reads your posts very thoroughly today :)
I hope it's not a bad thing. :)
It means, twenty minutes after I replied to the previous, I got your hours old post which mentions points I could have included in the aforementioned reply. Not really bad, but somewhat inconvenient.
Am Dienstag 29 Dezember 2009 14:58:10 schrieb Will Ness:
If I realistically needed primes generated in a real life setting, I'd probably had to use some C for that.
If you need the utmost speed, then probably yes. If you can do with a little less, my STUArray bitsieves take about 35% longer than the equivalent C code and are roughly eight times faster than ONeillPrimes. I can usually live well with that.
Wow! That's fast! (that's the code from haskellwiki's primes page, right?)
No, it's my own code. Nothing elaborate, just sieving numbers 6k±1, twice as fast as the haskellwiki code (here) and uses only 1/3 the memory. For the record: ---------------------------------------------------------------------- {-# LANGUAGE BangPatterns #-} module SievePrimes where import Data.Array.Base (unsafeRead, unsafeWrite, unsafeAt) import Data.Array.ST import Data.Array.Unboxed import Control.Monad (when) import Data.Bits primesUpTo :: Integer -> [Integer] primesUpTo bound = 2:3:[fromIntegral (3*i + (if testBit i 0 then 4 else 5)) | i <- [0 .. bd], par `unsafeAt` i] where bd0 = 2*(bound `div` 6) - case bound `mod` 6 of { 0 -> 2; 5 -> 0; _ -> 1 } bd = fromInteger bd0 sr = floor (sqrt (fromIntegral bound)) sbd = 2*(sr `div` 6) - case sr `mod` 6 of { 0 -> 2; 5 -> 0; _ -> 1 } par :: UArray Int Bool par = runSTUArray $ do ar <- newArray (0,bd) True let tick i s1 s2 | bd < i = return () | otherwise = do p <- unsafeRead ar i when p (unsafeWrite ar i False) tick (i+s1) s2 s1 sift i | i > sbd = return ar | otherwise = do p <- unsafeRead ar i when p (let (!start,!s1,!s2) = if even i then (i*(3*i+10)+7,2*i+3,4*i+7) else (i*(3*i+8)+4,4*i+5,2*i+3) in tick start s1 s2) sift (i+1) sift 0 ---------------------------------------------------------------------- The index magic is admittedly not obvious, but if you take that on faith, the rest is pretty clear. To find the n-th prime, ---------------------------------------------------------------------- module Main (main) where import SievePrimes (primesUpTo) import System.Environment (getArgs) printNthPrime :: Int -> IO () printNthPrime n = print (n, primesUpTo (estim n) !! (n-1)) main = do args <- getArgs printNthPrime $ read $ head args estim :: Int -> Integer estim k | n < 13 = 37 | n < 70 = 349 | otherwise = round x where n = fromIntegral k :: Double x = n * (log n + log (log n) - 1 + 1.05 * log (log n) / log n) ---------------------------------------------------------------------- If I don't construct the prime list but count directly on the array, it's ~6% faster.
If OTOH we're talking about a tutorial code that is to be as efficient as possible without loosing it clarity, being a reflection of essentials of the problem, then any overly complicated advanced Haskell wouldn't be my choice either.
+1 Though perhaps we view mutable array code differently. In my view, it's neither advanced nor complicated.
convoluted, then. Not using "higher level" concepts, like map and foldr, :) or head.until isSingleton (pairwise merge).map wrap , that kind of thing. :)
BTW I think a really smart compiler should just get a specification, like Turner's sieve, and just derive a good code from that by itself.
Go ahead and write one. I would love such a beast.
Another example would be
qq n m = sum $ take n $ cycle [1..m]
which should really get compiled into just a math formula, IMO. Now _that_ I would call a good compiler.
Dream on, dream on, with hope in your heart.
That way I really won't have to learn how to use STUArray`s you see.
I've seen this question asked a lot, what should be a good programming language?
IMO, English (plus math where needed, and maybe some sketches by hand). :)
Maybe you'd like http://en.wikipedia.org/wiki/Shakespeare_(programming_language) ?

Daniel Fischer
No, it's my own code. Nothing elaborate, just sieving numbers 6k±1, twice as fast as the haskellwiki code (here) and uses only 1/3 the memory. For the record: ..... thanks! will need to sift through it thoroughly... :) :)
BTW I think a really smart compiler should just get a specification, like Turner's sieve, and just derive a good code from that by itself.
Go ahead and write one. I would love such a beast.
Another example would be
qq n m = sum $ take n $ cycle [1..m]
which should really get compiled into just a math formula, IMO. Now _that_ I would call a good compiler.
Dream on, dream on, with hope in your heart.
Those who can't do, dream. And rant. :)
Maybe you'd like http://en.wikipedia.org/wiki/Shakespeare_(programming_language) ?
niice. :)
_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe <at> haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

Am Dienstag 29 Dezember 2009 14:34:03 schrieb Will Ness:
Daniel Fischer
writes: Am Dienstag 29 Dezember 2009 04:38:21 schrieb Will Ness:
Now _this_, when tested as interpreted code in GHCi, runs about 2.5x times faster than Priority Queue based code from Melissa O'Neill's ZIP package mentioned at the haskellwiki/Prime_Numbers page, with about half used memory reported, in producing 10,000 to 300,000 primes.
It is faster than BayerPrimes.hs from the ZIP package too, in the tested range, at about 35 lines of code in total.
That's nice. However, the important criterion is how compiled code (-O2)
fares. Do the relations continue to hold? How does it compare to a bitsieve?
Haven't gotten to that part yet. :)
But why is it more important?
I thought the uppercase FASTER in the subject meant you were really interested in speed. If you're only interested in asymptotics, interpreted may be appropriate. However, it is possible that optimisation can change the perceived asymptotics of an algorithm (determined strictness eliminating thunks for example). While I haven't detected that with the primes code, I find that in my ghci your code is approximately 2.5 times faster than ONeill or Bayer when interpreted (no difference in scaling observed), while when compiled with -O2, ONeill is approximately three times as fast as your code and twice as fast as Bayer as an executable, about twice as fast as your code and slightly slower than Bayer in ghci. And I have huge memory problems in ghci with your code. That may be due to my implementation of merge and minus, though. You wrote 'standard' and I coded the straightforward methods.
Would that not tell us more about the compiler performance than the code itself?
Unless you write machine code or assembly, don't all performance tests tell us more about the compiler/interpreter performance than the code itself? That is, of course, with respect to algorithms with the same scaling behaviour.
This code is just an endpoint (so far) in a short procession of natural stepwise development of the famous classic Turner's sieve,
That was sieve (x:xs) = x:sieve (filter ((/= 0) . (`mod` x)) xs) , was it?
through the "postponed filters", through to Euler's sieve, the merging sieve (i.e. Richard Bird's) and on to the tree-fold merging, with wheel. I just wanted to see where the simple "normal" (i.e. _beginner_-friendly) functional code can get, in a natural way.
Good.
It's not about writing the fastest code in _advanced_ Haskell. It's about having clear and simple code that can be understood at a glance - i.e. contributes to our understanding of a problem - faithfully reflecting its essential elements, and because of _that_, fast. It's kind of like _not_ using mutable arrays in a quicksort.
What's wrong with mutable arrays? There are a lot of algorithms which can be easily and efficiently implemented using mutable unboxed arrays while a comparably efficient implementation without mutable arrays is hard. For those, I consider STUArrays the natural choice. Sieving primes falls into that category.
Seeing claims that it's _either_ Turner's _or_ the PQ-based code didn't feel right to me somehow,
I fully agree.
especially the claim that going by primes squares is "a pleasing but minor optimization",
Which it is not. It is a major optimisation. It reduces the algorithmic complexity *and* reduces the constant facors significantly.
what with the postponed filters (which serves as the framework for all the other variants) achieving the orders of magnitude speedup and cutting the Turner's O(n^2) right down to O(n^1.5) just by doing that squares optimization (with the final version hovering around 1.24..1.17 in the tested range). The Euler's sieve being a special case of Eratosthenes's, too, doesn't let credence to claims that only the PQ version is somehow uniquely authentic and "faithful" to it.
I never found that claim convincing either.
Turner's sieve should have been always looked at as just a specification, not a code, anyway, and actually running it is ridiculous. Postponed filters version, is the one to be used as a reference point of the basic _code_, precisely because it _does_ use the primes squares optimization, which _is_ essential to any basic sieve.

Daniel Fischer
Am Dienstag 29 Dezember 2009 14:34:03 schrieb Will Ness:
Daniel Fischer
writes: Am Dienstag 29 Dezember 2009 04:38:21 schrieb Will Ness:
Now _this_, when tested as interpreted code in GHCi, runs about 2.5x times faster than Priority Queue based code from Melissa O'Neill's ZIP package mentioned at the haskellwiki/Prime_Numbers page, with about half used memory reported, in producing 10,000 to 300,000 primes.
It is faster than BayerPrimes.hs from the ZIP package too, in the tested range, at about 35 lines of code in total.
That's nice. However, the important criterion is how compiled code (-O2)
fares. Do the relations continue to hold? How does it compare to a bitsieve?
Haven't gotten to that part yet. :)
But why is it more important?
I thought the uppercase FASTER in the subject meant you were really interested in speed. If you're only interested in asymptotics, interpreted may be appropriate.
However, it is possible that optimisation can change the perceived asymptotics of an algorithm (determined strictness eliminating thunks for example).
While I haven't detected that with the primes code, I find that in my ghci your code is approximately 2.5 times faster than ONeill or Bayer when interpreted (no difference in scaling observed), while when compiled with -O2, ONeill is approximately three times as fast as your code
that was what I was getting at first too, before I've put into my code the _type_signatures_ and the "specialize" _pragmas_ as per her file. Then it was only 1.3x slower, when compiled (with about same asymptotics and memory usage).
and twice as fast as Bayer as an executable, about twice as fast as your code and slightly slower than Bayer in ghci.
see, this kind of inconsistencies is exactly why I was concentrating only on one platform in measuring the speed - the interp'/GHCi combination. Especially when developing and trying out several approaches, to test with compiler just takes too long. :) And why should it give (sometimes) wildly different readings when running inside GHCi or standalone ??
And I have huge memory problems in ghci with your code. That may be due to my implementation of merge and minus, though. You wrote 'standard' and I coded the straightforward methods.
Here's what I'm using (BTW I've put it on the primes haskellwiki page too). The memory reported for interpreted is about half of PQ's (IIRC), and compiled - the same: minus a@(x:xs) b@(y:ys) = case compare x y of LT -> x: xs `minus` b GT -> a `minus` ys EQ -> xs `minus` ys minus a b = a merge a@(x:xs) b@(y:ys) = case compare x y of LT -> x: merge xs b EQ -> x: merge xs ys GT -> y: merge a ys merge a b = if null b then a else b
Would that not tell us more about the compiler performance than the code itself?
Unless you write machine code or assembly, don't all performance tests tell us
more about the compiler/interpreter performance than the code itself?
That is, of course, with respect to algorithms with the same scaling behaviour.
This code is just an endpoint (so far) in a short procession of natural stepwise development of the famous classic Turner's sieve,
That was
sieve (x:xs) = x:sieve (filter ((/= 0) . (`mod` x)) xs)
, was it?
right
through the "postponed filters", through to Euler's sieve, the merging sieve (i.e. Richard Bird's) and on to the tree-fold merging, with wheel. I just wanted to see where the simple "normal" (i.e. _beginner_-friendly) functional code can get, in a natural way.
Good.
It's not about writing the fastest code in _advanced_ Haskell. It's about having clear and simple code that can be understood at a glance - i.e. contributes to our understanding of a problem - faithfully reflecting its essential elements, and because of _that_, fast. It's kind of like _not_ using mutable arrays in a quicksort.
What's wrong with mutable arrays? There are a lot of algorithms which can be easily and efficiently implemented using mutable unboxed arrays while a comparably efficient implementation without mutable arrays is hard. For those, I consider STUArrays the natural choice. Sieving primes falls into that category.
It's just that the mutating code tends to be convoluted, like in the example I mentioned of quicksort. One has to read the C code with good attention to understand it. "Normal" Haskell is much more visually apparent, like primes = 2: 3: sieve (tail primes) [5,7..] where sieve (p:ps) xs = h ++ sieve ps (t `minus` tail [q,q+2*p..]) where (h,~(_:t)) = span (< q) xs q = p*p or primes = 2: 3: sieve [] (tail primes) 5 where sieve fs (p:ps) x = [i | i<- [x,x+2..q-2], a!i] ++ sieve ((2*p,q):fs') ps (q+2) where q = p*p mults = [ [y+s,y+2*s..q] | (s,y)<- fs] fs' = [ (s,last ms) | ((s,_),ms)<- zip fs mults] a = accumArray (\a b->False) True (x,q-2) [(i,()) | ms<- mults, i<- ms]
Seeing claims that it's _either_ Turner's _or_ the PQ-based code didn't feel right to me somehow,
I fully agree.
:) :) :) :) :)
especially the claim that going by primes squares is "a pleasing but minor optimization",
Which it is not. It is a major optimisation. It reduces the algorithmic
complexity *and* reduces the constant facors significantly. Exactly! Seeing this claim was just incredible to me. I've spent a considerable time when I first learned Haskell, tweaking the SICP code (as I remembred it; probably very similar to Turner's) until coming up with an equivalent of the "postponed sieve" (some years ago, didn't know about "span" yet :) ). But I assumed that this result was well known. Turner's sieve should long be regarded as _specification_, not an actual _code_, I thought. I think what happened was that Melissa O'Neill thought about the mutable storage i.e. "imperative" implementation when she said that, where numbers do get "crossed off" from the same "canvas". But here in functional code we don't "cross off" no numbers; we deal with numbers supply and filtering and merging, and nested function calls with their overhead etc., which costs can't be just ignored. IOW there's no "crossing off" done by any of extra filters, which nevertheless are all VERY busy, doing nothing. _Not_ "crossing" the multiples "off".
what with the postponed filters (which serves as the framework for all the other variants) achieving the orders of magnitude speedup and cutting the Turner's O(n^2) right down to O(n^1.5) just by doing that squares optimization (with the final version hovering around 1.24..1.17 in the tested range). The Euler's sieve being a special case of Eratosthenes's, too, doesn't let credence to claims that only the PQ version is somehow uniquely authentic and "faithful" to it.
I never found that claim convincing either.
I think what got crossed probably was "faithful to the original algorithm", with "faithful" to its typical imperative mutable storage implementation, as in "having same _complexity_". In that sense of course, linear merging is worse; it has worse complexity than "C", but is nevertheless faithful to the original algorithm, only under the functional setting. It is worse because of linear nature of lists, and it is all too easy to overlook the possibility of tree folding and jump to the conclusion that one needs a specialized data structure for that... But the article didn't even get to that part; instead it was all about proving rigorously that divisibility testing for primes is very costly (without actually formulating this conclusion). It was frustrating to read that "details of what gets crossed off and how, matter" without these details being actually spelled out - simply, that primes shouldn't be tested at all. That's the real insight of the article, IMO.
Turner's sieve should have been always looked at as just a specification, not a code, anyway, and actually running it is ridiculous. Postponed filters version, is the one to be used as a reference point of the basic _code_, precisely because it _does_ use the primes squares optimization, which _is_ essential to any basic sieve.
_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe <at> haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

Am Mittwoch 30 Dezember 2009 01:04:34 schrieb Will Ness:
While I haven't detected that with the primes code, I find that in my ghci your code is approximately 2.5 times faster than ONeill or Bayer when interpreted (no difference in scaling observed), while when compiled with -O2, ONeill is approximately three times as fast as your code
that was what I was getting at first too, before I've put into my code the _type_signatures_ and the "specialize" _pragmas_ as per her file. Then it was only 1.3x slower, when compiled (with about same asymptotics and memory usage).
Specialising your code to Int makes it half as fast as ONeill here (as an executable). That is largely due to the fact that your code uses much more memory here (54MB vs. 2MB for the millionth prime), though, the MUT times have a ratio of about 1.5. Now an interesting question is, why does it use so much memory here? Can you send me your exact code so I can see how that behaves here?
and twice as fast as Bayer as an executable, about twice as fast as your code and slightly slower than Bayer in ghci.
see, this kind of inconsistencies is exactly why I was concentrating only on one platform in measuring the speed - the interp'/GHCi combination.
The problem with that is that one is primarily interested in speed for library functions, which are mostly used as compiled code.
Especially when developing and trying out several approaches, to test with compiler just takes too long. :) And why should it give (sometimes) wildly different readings when running inside GHCi or standalone ??
Good question.
And I have huge memory problems in ghci with your code. That may be due to my implementation of merge and minus, though. You wrote 'standard' and I coded the straightforward methods.
Here's what I'm using (BTW I've put it on the primes haskellwiki page too). The memory reported for interpreted is about half of PQ's (IIRC), and compiled - the same:
minus a@(x:xs) b@(y:ys) = case compare x y of LT -> x: xs `minus` b GT -> a `minus` ys EQ -> xs `minus` ys minus a b = a
merge a@(x:xs) b@(y:ys) = case compare x y of LT -> x: merge xs b EQ -> x: merge xs ys GT -> y: merge a ys merge a b = if null b then a else b
More or less the same that I wrote.
What's wrong with mutable arrays? There are a lot of algorithms which can be easily and efficiently implemented using mutable unboxed arrays while a comparably efficient implementation without mutable arrays is hard. For those, I consider STUArrays the natural choice. Sieving primes falls into that category.
It's just that the mutating code tends to be convoluted, like in the example I mentioned of quicksort. One has to read the C code with good attention to understand it.
Convoluted is (often) an exaggeration. But I agree that the specification of 'what' is usually easier to understand than that of 'how'.
"Normal" Haskell is much more visually apparent, like
primes = 2: 3: sieve (tail primes) [5,7..] where sieve (p:ps) xs = h ++ sieve ps (t `minus` tail [q,q+2*p..]) where (h,~(_:t)) = span (< q) xs q = p*p
Yes.
or
primes = 2: 3: sieve [] (tail primes) 5 where sieve fs (p:ps) x = [i | i<- [x,x+2..q-2], a!i] ++ sieve ((2*p,q):fs') ps (q+2) where q = p*p mults = [ [y+s,y+2*s..q] | (s,y)<- fs] fs' = [ (s,last ms) | ((s,_),ms)<- zip fs mults] a = accumArray (\a b->False) True (x,q-2) [(i,()) | ms<- mults, i<- ms]
Umm, really? I'd think if you see what that does, you won't have difficulties with a mutable array sieve.

Daniel Fischer
Am Mittwoch 30 Dezember 2009 01:04:34 schrieb Will Ness:
While I haven't detected that with the primes code, I find that in my ghci your code is approximately 2.5 times faster than ONeill or Bayer when interpreted (no difference in scaling observed), while when compiled with -O2, ONeill is approximately three times as fast as your code
that was what I was getting at first too, before I've put into my code the _type_signatures_ and the "specialize" _pragmas_ as per her file. Then it was only 1.3x slower, when compiled (with about same asymptotics and memory usage).
Specialising your code to Int makes it half as fast as ONeill here (as an executable). That is largely due to the fact that your code uses much more memory here (54MB vs. 2MB for the millionth prime), though, the MUT times have a ratio of about 1.5.
I'm an unsophisticated tester. I just use GHC -O2 -c filename.hs GHCi filename and then it says ( for primes()!!1000000 ) (8.24 secs, 1906813836 bytes) for my code, and (6.09 secs, 1800873864 bytes) for O'Neill's But now when I've looked at system resources I see this too. Well, it means we've found where the PQ code is better. OK.
Now an interesting question is, why does it use so much memory here? Can you send me your exact code so I can see how that behaves here?
will do. It's probably doing a lot more bookkeeping. Or it might be some impl issue with scanl or span etc., and it'll go away if we'd recode it directly, who knows. We can only guess if we don't know the compiler in and out. That's exactly what kept me off using the compiler. Guessing. Sheesh. (and I did see a 2.0x speedup once when replacing one simple code snippet for its operationally equivalent twin).
and twice as fast as Bayer as an executable, about twice as fast as your code and slightly slower than Bayer in ghci.
see, this kind of inconsistencies is exactly why I was concentrating only on one platform in measuring the speed - the interp'/GHCi combination.
The problem with that is that one is primarily interested in speed for library functions, which are mostly used as compiled code.
right and I used it as a measure of code's "fitness" to the problem so it was only comparative to me.
Especially when developing and trying out several approaches, to test with compiler just takes too long. :) And why should it give (sometimes) wildly different readings when running inside GHCi or standalone ??
Good question.
And I have huge memory problems in ghci with your code. That may be due to my implementation of merge and minus, though. You wrote 'standard' and I coded the straightforward methods.
Here's what I'm using ...
More or less the same that I wrote.
The rest is exactly as I've posted it, I've only added type signatures and specialize pragmas, as per Melissa's code, {-# SPECIALIZE primes :: () -> [Int] #-} {-# SPECIALIZE primes :: () -> [Integer] #-} primes :: Integral a => () -> [a] primes () = 2:3:5:7:primes' where primes' = [11,13] ++ drop 2 (rollFrom 11) `minus` comps (comps,_) = tfold mergeSP (pairwise mergeSP mults) mults = map (\p-> fromList $ map (p*) $ rollFrom p) $ primes' maybe it's about memoization of primes'. She writes something about it in her code.
It's just that the mutating code tends to be convoluted, like in the example I mentioned of quicksort. One has to read the C code with good attention to understand it.
Convoluted is (often) an exaggeration. But I agree that the specification of 'what' is usually easier to understand than that of 'how'.
well put.
"Normal" Haskell is much more visually apparent, like
primes = 2: 3: sieve (tail primes) [5,7..] where sieve (p:ps) xs = h ++ sieve ps (t `minus` tail [q,q+2*p..]) where (h,~(_:t)) = span (< q) xs q = p*p
Yes.
or
primes = 2: 3: sieve [] (tail primes) 5 where sieve fs (p:ps) x = [i | i<- [x,x+2..q-2], a!i] ++ sieve ((2*p,q):fs') ps (q+2) where q = p*p mults = [ [y+s,y+2*s..q] | (s,y)<- fs] fs' = [ (s,last ms) | ((s,_),ms)<- zip fs mults] a = accumArray (\a b->False) True (x,q-2) [(i,()) | ms<- mults, i<- ms]
Umm, really? I'd think if you see what that does, you won't have difficulties with a mutable array sieve.
You're right, bad example. :)
_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe <at> haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

Daniel Fischer
Am Mittwoch 30 Dezember 2009 20:46:57 schrieb Will Ness:
Daniel Fischer
writes: Am Dienstag 29 Dezember 2009 20:16:59 schrieb Daniel Fischer:
especially the claim that going by primes squares is "a pleasing but minor optimization",
Which it is not. It is a major optimisation. It reduces the algorithmic complexity *and* reduces the constant factors significantly.
D'oh! Thinko while computing sum (takeWhile (<= n) primes) without paper. It doesn't change the complexity, and the constant factors are reduced far less than I thought.
I do not understand. Turner's sieve is
primes = sieve [2..] where sieve (p:xs) = p : sieve [x | x<-xs, x `mod` p /= 0]
and the Postponed Filters is
primes = 2: 3: sieve (tail primes) [5,7..] where sieve (p:ps) xs = h ++ sieve ps [x | x<-t, x `rem` p /= 0] where (h,~(_:t)) = span (< p*p) xs
Are you saying they both exhibit same complexity?
No. They don't. But if you're looking at an imperative (mutable array) sieve (that's simpler to analyse because you don't have to take the book-keeping costs of your priority queue, heap or whatever into account), if you start crossing out
The key question then, is _*WHEN*_ and not _*WHAT*_. As is clearly demonstrated by the case of Turner/Postponed filters, the work that is done (of crossing numbers off) is the same - _when_ it is actually done - but Turner's starts out so _prematurely_ that it is busy doing nothing most of the time. Thus its function call overhead costs pile up enormously, overstaging the actual calculation. So analyzing that calculation in the premature execution setting is missing the point, although helpful after we fix this, with the Postponed Filters. _Only then_ the finer points of this algorithm's analysis can be applied - namely, of avoiding testing primes divisibility altogether. And _if_ a fast cheap primality test were to have existed, the filtering versions would win over, because they progressively cull the input sequence so there would be no double hits as we have when merging the multiples (whether from lists or inside the PQ).
the multiples of p with 2*p, you have
sum [bound `div` p - 1 | p <- takeWhile (<= sqrt bound) primes]
crossings-out, that is Theta(bound*log (log bound)). If you eliminate multiples of some small primes a priori (wheel), you can reduce the constant factor significantly, but the complexity remains the same (you drop a few terms from the front of the sum and multiply the remaining terms with phi(n)/n, where n is the product of the excluded primes).
If you start crossing out at p^2, the number is
sum [bound `div` p - (p-1) | p <- takeWhile (<= sqrt bound) primes].
The difference is basically sum (takeWhile (<= sqrt bound) primes), which I stupidly - I don't remember how - believed to cancel out the main term. It doesn't, it's O(bound/log bound), so the complexity is the same.
Now if you take a stream of numbers from which you remove composites, having a priority queue of multiples of primes, things are a little different. If you start crossing out at 2*p, when you are looking at n, you have more multiples in your PQ than if you start crossing out at p^2 (about pi(n/2) vs. pi(sqrt n)), so updating the PQ will be more expensive. But updating the PQ is O(log size), I believe, and log pi(n) is O(log pi(sqrt n)), so I think it shouldn't change the complexity here either. I think this would have complexity O(bound*log bound*log (log bound)).
There are two questions here - where to start crossing off numbers, and when. If you'd start at 2*p maybe the overall complexity would remain the same but it'll add enormous overhead with all those duplicate multiples. No, the question is not where to start, but when. PQ might hide the problem until the memory blows up. Anything that we add that won't have any chance of contributing to the final result, is added for nothing and only drives the total cost up needlessly.
I was under impression that the first shows O(n^2) approx., and the second one O(n^1.5) (for n primes produced).
In Turner/postponed filters, things are really different. Actually, Ms. O'Neill is right, it is a different algorithm. In the above, we match each
what _is_ different is divisibility testing vs composites removal, which follows from her in-depth analysis although is never quite formulated in such words in her article. But nothing matters until the premature starting up is eliminated, and that key observation is missing for the article either - worse, it is brushed off with the casual remark that it is "a pleasing but minor optimization". Which remark, as you show here, is true in the imperative, mutable-storage setting, but is made in an article abut functional code, in regard to the functional code of Turner's sieve. So the key understanding is overlooked. Her article even adds the primes into the PQ prematurely itself, as soon as the prime is discovered (she fixes that in her ZIP package). With the PQ keeping these prematurely added elements deep inside its guts, the problem might not even manifest itself immediately, but the memory blow-up would eventually hit the wall (having the PQ contain all the preceding primes, instead of just those below the square root of a limit - all the entries above the square root not contributing to the calculation at all). And what we're after here is the insight anyway. Skipping steps of natural development does not contribute to garnering an insight. Most prominent problem with Turner's /code/ _specification_, is the premature start ups, and divisibility testing of primes (as Melissa O'Neill's analysis shows). Fix one, and you get Postponed Filters (which should really be used as a basis reference point for all the rest). Fix the other - and you've got the Euler's sieve: primes = 2: 3: sieve (tail primes) [5,7..] where sieve (p:ps) xs = h ++ sieve ps [t `minus` tail [p*p, p*p+2*p..]] where (h,~(_:t)) = span (< p*p) xs Clear, succinct, and plenty efficient for an introductory textbook functional lazy code, of the same order of magnitude performance as PQ code on odds only. Now comparing the PQ performance against _that_ would only be fare, and IMO would only add to its value - it is faster, has better asymptotics, and is greatly amenable to the wheel optimization right away.
prime only with its multiples (plus the next composite in the PQ version). In Turner's algorithm, we match each prime with all smaller primes (and each composite with all primes <= its smallest prime factor, but that's less work than the primes). That gives indeed a complexity of O(pi(bound)^2).
In the postponed filters, we match each prime p with all primes <= sqrt p (again, the composites contribute less). I think that gives a complexity of O(pi(bound)^1.5*log (log bound)) or O(n^1.5*log (log n)) for n primes produced.
Empirically, it was always _below_ 1.5, and above 1.4 , and PQ/merged multiples removal around 1.25..1.17 .
_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe <at> haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

Am Samstag 02 Januar 2010 14:13:29 schrieb Will Ness:
Daniel Fischer
writes: Am Mittwoch 30 Dezember 2009 20:46:57 schrieb Will Ness:
Daniel Fischer
writes: Am Dienstag 29 Dezember 2009 20:16:59 schrieb Daniel Fischer:
especially the claim that going by primes squares is "a pleasing but minor optimization",
Which it is not. It is a major optimisation. It reduces the algorithmic complexity *and* reduces the constant factors significantly.
D'oh! Thinko while computing sum (takeWhile (<= n) primes) without paper. It doesn't change the complexity, and the constant factors are reduced far less than I thought.
I do not understand. Turner's sieve is
primes = sieve [2..] where sieve (p:xs) = p : sieve [x | x<-xs, x `mod` p /= 0]
and the Postponed Filters is
primes = 2: 3: sieve (tail primes) [5,7..] where sieve (p:ps) xs = h ++ sieve ps [x | x<-t, x `rem` p /= 0] where (h,~(_:t)) = span (< p*p) xs
Are you saying they both exhibit same complexity?
No. They don't. But if you're looking at an imperative (mutable array) sieve (that's simpler to analyse because you don't have to take the book-keeping costs of your priority queue, heap or whatever into account), if you start crossing out
The key question then, is _*WHEN*_ and not _*WHAT*_. As is clearly demonstrated by the case of Turner/Postponed filters, the work that is done (of crossing numbers off) is the same
Crossing off is only part of the work. Most of the work is checking whether to cross off the number in this round. And Turner does a lot more of that than the postponed filters.
- _when_ it is actually done - but Turner's starts out so _prematurely_ that it is busy doing nothing most of the time.
It's not doing nothing. It's just doing a lot of superfluous work. It is _achieveing_ nothing most of the time, though. Take 7919, the thousandth prime. The postponed filters decide to keep it when fitering out the multiples of 89, the twenty-fourth prime. Turner also divides it by all 975 primes in between. That is a lot of real but futile work.
Thus its function call overhead costs pile up enormously, overstaging the actual calculation.
It's not only the function calls. Division is expensive, too.
So analyzing that calculation in the premature execution setting is missing the point, although helpful after we fix this, with the Postponed Filters. _Only then_ the finer points of this algorithm's analysis can be applied - namely, of avoiding testing primes divisibility altogether. And _if_ a fast cheap primality test were to have existed, the filtering versions would win
Sorry, I can't follow. What's the point of a primality test here? Every number whose multiples we want to remove is prime, what's left to test?
over, because they progressively cull the input sequence so there would be no double hits as we have when merging the multiples (whether from lists or inside the PQ).
But there's a lot of list constructuion and deconstruction necessary for the Euler sieve. That may be more work than the multiple hits cause.
the multiples of p with 2*p, you have
sum [bound `div` p - 1 | p <- takeWhile (<= sqrt bound) primes]
crossings-out, that is Theta(bound*log (log bound)). If you eliminate multiples of some small primes a priori (wheel), you can reduce the constant factor significantly, but the complexity remains the same (you drop a few terms from the front of the sum and multiply the remaining terms with phi(n)/n, where n is the product of the excluded primes).
If you start crossing out at p^2, the number is
sum [bound `div` p - (p-1) | p <- takeWhile (<= sqrt bound) primes].
The difference is basically sum (takeWhile (<= sqrt bound) primes), which I stupidly - I don't remember how - believed to cancel out the main term. It doesn't, it's O(bound/log bound), so the complexity is the same.
Now if you take a stream of numbers from which you remove composites, having a priority queue of multiples of primes, things are a little different. If you start crossing out at 2*p, when you are looking at n, you have more multiples in your PQ than if you start crossing out at p^2 (about pi(n/2) vs. pi(sqrt n)), so updating the PQ will be more expensive. But updating the PQ is O(log size), I believe, and log pi(n) is O(log pi(sqrt n)), so I think it shouldn't change the complexity here either. I think this would have complexity O(bound*log bound*log (log bound)).
There are two questions here - where to start crossing off numbers, and when. If you'd start at 2*p maybe the overall complexity would remain the same but it'll add enormous overhead with all those duplicate multiples.
The additional duplicate multiples aren't the problem. Sure, the numbers having a prime divisor larger than the square root would be crossed off one additional time, but that isn't so much per se. The additional crossings off are O(bound), so they don't harm the time complexity. But they can have effects which multiply the running time by a large constant.
No, the question is not where to start, but when. PQ might hide the problem until the memory blows up. Anything that we add that won't have any chance of contributing to the final result, is added for nothing and only drives the total cost up needlessly.
Well, if you start crossing off at p^2 and feed your PQ from a separate prime generator, the memory blow-up is far away. E.g., when the main sieve is at 10^12, the PQ contains less than 80000 multiples. No space problem in sight. At 10^14, the PQ's size is still less than 700000. That's noticeable, but there's still plenty of space. If, on the other hand, you start crossung off at 2*p, when the main sieve is at 10^7, the size of the PQ is > 650000, at 10^8, the size is more than 5.5 million. That starts to become a memory problem rather soon.
I was under impression that the first shows O(n^2) approx., and the second one O(n^1.5) (for n primes produced).
In Turner/postponed filters, things are really different. Actually, Ms. O'Neill is right, it is a different algorithm. In the above, we match each
what _is_ different is divisibility testing vs composites removal, which follows from her in-depth analysis although is never quite formulated in such words in her article. But nothing matters until the premature starting up is eliminated, and that key observation is missing for the article either - worse, it is brushed off with the casual remark that it is "a pleasing but minor optimization". Which remark, as you show here, is true in the imperative, mutable-storage setting, but is made in an article abut functional code, in regard to the functional code of Turner's sieve.
I think that remark was meant to apply to composite removal, not Turner's sieve. But even then, the 'minor' isn't adequate considering her PQ approach where, although it doesn't change time complexity (in theory), it changes space complexity - and that may affect running time drastically. Probably she was thinking of the typical array sieve when she made that remark.
So the key understanding is overlooked.
Her article even adds the primes into the PQ prematurely itself, as soon as the prime is discovered (she fixes that in her ZIP package). With the PQ keeping these prematurely added elements deep inside its guts, the problem might not even manifest itself immediately, but the memory blow-up would eventually hit the wall (having the PQ contain all the preceding primes, instead of just those below the square root of a limit - all the entries above the square root not contributing to the calculation at all).
And what we're after here is the insight anyway. Skipping steps of natural development does not contribute to garnering an insight. Most prominent problem with Turner's /code/ _specification_, is the premature start ups, and divisibility testing of primes (as Melissa O'Neill's analysis shows). Fix one, and you get Postponed Filters (which should really be used as a basis reference point for all the rest). Fix the other - and you've got the Euler's sieve:
primes = 2: 3: sieve (tail primes) [5,7..] where sieve (p:ps) xs = h ++ sieve ps [t `minus` tail [p*p, p*p+2*p..]] where (h,~(_:t)) = span (< p*p) xs
Clear, succinct, and plenty efficient for an introductory textbook functional lazy code, of the same order of magnitude performance as PQ code on odds only.
Now comparing the PQ performance against _that_ would only be fare, and IMO would only add to its value - it is faster, has better asymptotics, and is greatly amenable to the wheel optimization right away.
prime only with its multiples (plus the next composite in the PQ version). In Turner's algorithm, we match each prime with all smaller primes (and each composite with all primes <= its smallest prime factor, but that's less work than the primes). That gives indeed a complexity of O(pi(bound)^2).
In the postponed filters, we match each prime p with all primes <= sqrt p (again, the composites contribute less). I think that gives a complexity of O(pi(bound)^1.5*log (log bound)) or O(n^1.5*log (log n)) for n primes produced.
Empirically, it was always _below_ 1.5, and above 1.4 , and PQ/merged multiples removal around 1.25..1.17 .
I should really stop doing those calculations in my head, it seems I'm getting too old for that. Doing it properly, on paper, the cost for identifying p_k is pi(sqrt p_k) [trial division by all primes less than sqrt p_k]. p_k is approximately k*log k, so pi(sqrt p_k) is approximately (sqrt $ k* log k)/log (sqrt $ k*log k) ~ 2*sqrt (k/log k). Summing that, the cost for identifying the first n primes is Theta(n^1.5/log n). Using a feeder, i.e. primes = 2:3.sieve feederprimes [5, 7 .. ], where feederprimes are generated as above, to reduce memory pressure and GC impact, that fits pretty well with my measurements of the time to find p_k for several k between 500,000 and 20,000,000.

Daniel Fischer
Am Samstag 02 Januar 2010 14:13:29 schrieb Will Ness:
Daniel Fischer
writes: Am Mittwoch 30 Dezember 2009 20:46:57 schrieb Will Ness:
Daniel Fischer
writes: Am Dienstag 29 Dezember 2009 20:16:59 schrieb Daniel Fischer:
> especially the claim that going by primes squares > is "a pleasing but minor optimization",
Which it is not. It is a major optimisation. It reduces the algorithmic complexity *and* reduces the constant factors significantly.
D'oh! Thinko while computing sum (takeWhile (<= n) primes) without paper. It doesn't change the complexity, and the constant factors are reduced far less than I thought.
I do not understand. Turner's sieve is
primes = sieve [2..] where sieve (p:xs) = p : sieve [x | x<-xs, x `mod` p /= 0]
and the Postponed Filters is
primes = 2: 3: sieve (tail primes) [5,7..] where sieve (p:ps) xs = h ++ sieve ps [x | x<-t, x `rem` p /= 0] where (h,~(_:t)) = span (< p*p) xs
Are you saying they both exhibit same complexity?
No. They don't. But if you're looking at an imperative (mutable array) sieve (that's simpler to analyse because you don't have to take the book-keeping costs of your priority queue, heap or whatever into account), if you start crossing out
The key question then, is _*WHEN*_ and not _*WHAT*_. As is clearly demonstrated by the case of Turner/Postponed filters, the work that is done (of crossing numbers off) is the same
Crossing off is only part of the work. Most of the work is checking whether to cross off the number in this round. And Turner does a lot more of that than the postponed filters.
Exactly the point I tried to make. :)
- _when_ it is actually done - but Turner's starts out so _prematurely_ that it is busy doing nothing most of the time.
It's not doing nothing. It's just doing a lot of superfluous work. It is _achieveing_ nothing most of the time, though.
again, yes. :)
Take 7919, the thousandth prime. The postponed filters decide to keep it when fitering out the multiples of 89, the twenty-fourth prime. Turner also divides it by all 975 primes in between. That is a lot of real but futile work.
yes.
Thus its function call overhead costs pile up enormously, overstaging the actual calculation.
It's not only the function calls. Division is expensive, too.
yes, that's what I meant - the cost of calling all the fuctions that - we know in advance will - have nothing to do eventually.
So analyzing that calculation in the premature execution setting is missing the point, although helpful after we fix this, with the Postponed Filters. _Only then_ the finer points of this algorithm's analysis can be applied - namely, of avoiding testing primes divisibility altogether. And _if_ a fast cheap primality test were to have existed, the filtering versions would win
Sorry, I can't follow. What's the point of a primality test here? Every number whose multiples we want to remove is prime, what's left to test?
sorry for being obstruse. I meant in a filtering sieve (i.e. postponed filters) vs the multiples removing sieves, if only the cheap primality test have existed (by some _magic_) which _could_ be run on _every_ numer (unlike the costly divisiility test), than the filtering sieves would win. Hypothetically.
over, because they progressively cull the input sequence so there would be no double hits as we have when merging the multiples (whether from lists or inside the PQ).
But there's a lot of list constructuion and deconstruction necessary for the Euler sieve.
yes. Unless, of course, s smart compiler recognizes there's only one consumer for the values each multiples-list is producing, and keeps it stripped down to a generator function, and its current value. I keep thinkig a smart compiler could eliminate all these "span" calls and replace them with just some pointers manipulating...
That may be more work than the multiple hits cause.
so that too would make filters win; only _if_ the cheap primality test existed!..
the multiples of p with 2*p, you have ...
There are two questions here - where to start crossing off numbers, and when. If you'd start at 2*p maybe the overall complexity would remain the same but it'll add enormous overhead with all those duplicate multiples.
The additional duplicate multiples aren't the problem. Sure, the numbers having a prime divisor larger than the square root would be crossed off one additional time, but that isn't so much per se. The additional crossings off are O(bound), so they don't harm the time complexity. But they can have effects which multiply the running time by a large constant.
yes, exactly what I wanted to say. :)
No, the question is not where to start, but when. PQ might hide the problem until the memory blows up. Anything that we add that won't have any chance of contributing to the final result, is added for nothing and only drives the total cost up needlessly.
Well, if you start crossing off at p^2 and feed your PQ from a separate prime generator, the memory blow-up is far away. E.g., when the main sieve is at 10^12, the PQ contains less than 80000 multiples. No space problem in sight. At 10^14, the PQ's size is still less than 700000. That's noticeable, but there's still plenty of space.
again, what I mean is, not _where_ I start crossing them off in a PQ, but _when_. The article's code starts crossing them off _at_ p^2 - by adding p^2+2p into the PQ - _as_ _soon_ as p itself is reached. It won't surface until p^2 will be considered for a prime; it'll lay dormant deep inside the queue's guts. When reaching 7919, the thousand (i.e. pi(7919) ) entries will hang out inside the PQ - instead of just 24. A memory blowup. (this is of course fixed in Melissa's ZIP package). Of course due to the nature of PQ it might actually not hurt the performance for a while, depending on partcular PQ implementation. Complexity _and_ constant factors.
If, on the other hand, you start crossung off at 2*p, when the main sieve is at 10^7, the size of the PQ is > 650000, at 10^8, the size is more than 5.5 million. That starts to become a memory problem rather soon.
here you don't have a choice or when to add it - you have to add it at p itself - so the problem is clear. But even when you cross at p^2, the question remains, of when you add the p^2 entry into the PQ. That was my point. Postponed Filters code makes this clear, and thus hard to err on. Unfortunately, it wasn't present the article.
I was under impression that the first shows O(n^2) approx., and the second one O(n^1.5) (for n primes produced).
In Turner/postponed filters, things are really different. Actually, Ms. O'Neill is right, it is a different algorithm. In the above, we match each
what _is_ different is divisibility testing vs composites removal, which follows from her in-depth analysis although is never quite formulated in such words in her article. But nothing matters until the premature starting up is eliminated, and that key observation is missing for the article either - worse, it is brushed off with the casual remark that it is "a pleasing but minor optimization". Which remark, as you show here, is true in the imperative, mutable-storage setting, but is made in an article abut functional code, in regard to the functional code of Turner's sieve.
I think that remark was meant to apply to composite removal, not Turner's sieve.
It is right there on page 2, right when the Turner's sieve is presented and discussed. The only explanation that I see is that she thought of it in regards to the imperative code, just as her analysis concentrates only on calculation aspects of the imperative code itself.
But even then, the 'minor' isn't adequate considering her PQ approach where, although it doesn't change time complexity (in theory), it changes space complexity - and that may affect running time drastically. Probably she was thinking of the typical array sieve when she made that remark.
In short, that remark was not referring to what it seemingly refers to, in the article, and as such was very confusing. It was a big STOP sign on the way to Postponed Filters - Euler's - Bird's merged multiples - tree-merging (with wheel) road of little steps, and used as a justification for her to make a big leap across the chasm towards the PQ code. So to speak. :)
So the key understanding is overlooked.
And what we're after here is the insight anyway. Skipping steps of natural development does not contribute to garnering an insight.
.. and that is of course a matter of personal preference. A genius is perfectly capable of making big leaps across chasms. Heck they might even be able to fly :)
In the postponed filters, we match each prime p with all primes <= sqrt p (again, the composites contribute less). I think that gives a complexity of O(pi(bound)^1.5*log (log bound)) or O(n^1.5*log (log n)) for n primes produced.
Empirically, it was always _below_ 1.5, and above 1.4 , and PQ/merged multiples removal around 1.25..1.17 .
I should really stop doing those calculations in my head, it seems I'm getting too old for that. Doing it properly, on paper, the cost for identifying p_k is pi(sqrt p_k) [trial division by all primes less than sqrt p_k]. p_k is approximately k*log k, so pi(sqrt p_k) is approximately (sqrt $ k* log k)/log (sqrt $ k*log k) ~ 2*sqrt (k/log k). Summing that, the cost for identifying the first n primes is Theta(n^1.5/log n).
that would correspond to what I've seen much better.
Using a feeder, i.e. primes = 2:3.sieve feederprimes [5, 7 .. ], where feederprimes are generated as above, to reduce memory pressure and GC impact, that fits pretty well with my measurements of the time to find p_k for several k between 500,000 and 20,000,000.
The quesion of a memory blowup with the treefolding merge still remains. For some reason using its second copy for a feeder doesn't reduce the memory (as reported by standalone compiled program, GHCi reported values are useless) - it causes it to increase twice.....
_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe <at> haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

Will Ness
... It was a big STOP sign on the way to Postponed Filters - Euler's - Bird's merged multiples - tree-merging (with wheel) road of little steps, and used as a justification for her to make a big leap across the chasm towards the PQ code.
correction: "across the /supposed/ chasm". There is no chasm. There is a nice straight freeway, with rest stops and gas stations, and exits to local roads going across the county in every which way. :)
.. and that is of course a matter of personal preference. A genius is perfectly capable of making big leaps across chasms. Heck they might even be able to fly :)

Am Sonntag 03 Januar 2010 09:54:37 schrieb Will Ness:
Exactly the point I tried to make. :)
again, yes. :)
yes.
yes, that's what I meant - the cost of calling all the fuctions that - we know in advance will - have nothing to do eventually.
8-)
But there's a lot of list constructuion and deconstruction necessary for the Euler sieve.
yes. Unless, of course, s smart compiler recognizes there's only one consumer for the values each multiples-list is producing, and keeps it stripped down to a generator function, and its current value. I keep thinkig a smart compiler could eliminate all these "span" calls and replace them with just some pointers manipulating...
Of course I'm no smart compiler, but I don't see how it could be even possible to replace the span calls with pointer manipulation when dealing with lazily generated (infinite, if we're really mean) lists. Even when you're dealing only with strict finite lists, it's not trivial to do efficiently.
That may be more work than the multiple hits cause.
so that too would make filters win; only _if_ the cheap primality test existed!..
the multiples of p with 2*p, you have ...
There are two questions here - where to start crossing off numbers, and when. If you'd start at 2*p maybe the overall complexity would remain the same but it'll add enormous overhead with all those duplicate multiples.
The additional duplicate multiples aren't the problem. Sure, the numbers having a prime divisor larger than the square root would be crossed off one additional time, but that isn't so much per se. The additional crossings off are O(bound), so they don't harm the time complexity. But they can have effects which multiply the running time by a large constant.
yes, exactly what I wanted to say. :)
*g* I wondered what you meant by the distinction between where and when to cross off.
No, the question is not where to start, but when. PQ might hide the problem until the memory blows up. Anything that we add that won't have any chance of contributing to the final result, is added for nothing and only drives the total cost up needlessly.
Well, if you start crossing off at p^2 and feed your PQ from a separate prime generator, the memory blow-up is far away. E.g., when the main sieve is at 10^12, the PQ contains less than 80000 multiples. No space problem in sight. At 10^14, the PQ's size is still less than 700000. That's noticeable, but there's still plenty of space.
again, what I mean is, not _where_ I start crossing them off in a PQ, but _when_. The article's code starts crossing them off _at_ p^2 - by adding p^2+2p into the PQ - _as_ _soon_ as p itself is reached. It won't surface until p^2 will be considered for a prime; it'll lay dormant deep inside the queue's guts. When reaching 7919, the thousand (i.e. pi(7919) ) entries will hang out inside the PQ - instead of just 24. A memory blowup. (this is of course fixed in Melissa's ZIP package). Of course due to the nature of PQ it might actually not hurt the performance for a while, depending on partcular PQ implementation. Complexity _and_ constant factors.
It will start to have an impact pretty soon. Assuming at least one of the relevant PQ operations to be Theta(log size), each composite between ~400 and ~400000 (rough estimates) will take something like twice as long to handle. It will start to hurt really badly only a while later, though, as a guesstimate, with more than a million primes in the PQ, memory will have a hard time.
If, on the other hand, you start crossung off at 2*p, when the main sieve is at 10^7, the size of the PQ is > 650000, at 10^8, the size is more than 5.5 million. That starts to become a memory problem rather soon.
here you don't have a choice or when to add it - you have to add it at p itself - so the problem is clear. But even when you cross at p^2, the question remains, of when you add the p^2 entry into the PQ. That was my point.
Postponed Filters code makes this clear, and thus hard to err on. Unfortunately, it wasn't present the article.
I was under impression that the first shows O(n^2) approx., and the second one O(n^1.5) (for n primes produced).
In Turner/postponed filters, things are really different. Actually, Ms. O'Neill is right, it is a different algorithm. In the above, we match each
what _is_ different is divisibility testing vs composites removal, which follows from her in-depth analysis although is never quite formulated in such words in her article. But nothing matters until the premature starting up is eliminated, and that key observation is missing for the article either - worse, it is brushed off with the casual remark that it is "a pleasing but minor optimization". Which remark, as you show here, is true in the imperative, mutable-storage setting, but is made in an article abut functional code, in regard to the functional code of Turner's sieve.
I think that remark was meant to apply to composite removal, not Turner's sieve.
It is right there on page 2, right when the Turner's sieve is presented and discussed. The only explanation that I see is that she thought of it in regards to the imperative code, just as her analysis concentrates only on calculation aspects of the imperative code itself.
To be fair, she writes: "Let us first describe the original “by hand” sieve algorithm as practiced by Eratosthenes. ... The starting point of p^2 is a pleasing but minor optimization, which can be made because lower multiples will have already been crossed off when we found the primes prior to p. ... (This optimization does not affect the time complexity of the sieve, however, so its absence from the code in Section 1 is not our cause for worry.)" So it's in context of the imperative code (although rather numbers on paper than bits in RAM). For imperative (be it array or paper), it is indeed minor; for her PQ sieve, it can be considered minor from a theoretical point of view (doesn't change time complexity), but practically, anything that changes performance by a factor of two or more is only 'minor' for hopelessly inadequate algorithms (whether a computation would take ten or twenty billion years is indeed a minor difference; one hour or two is a major difference). However, as you say, the important point is not whether p's multiples get crossed off starting from 2*p or from p^2, but whether p's multiples get inserted into the queue when you look at p or at p^2.
But even then, the 'minor' isn't adequate considering her PQ approach where, although it doesn't change time complexity (in theory), it changes space complexity - and that may affect running time drastically. Probably she was thinking of the typical array sieve when she made that remark.
In short, that remark was not referring to what it seemingly refers to, in the article, and as such was very confusing. It was a big STOP sign on the way to Postponed Filters - Euler's - Bird's merged multiples - tree-merging (with wheel) road of little steps, and used as a justification for her to make a big leap across the chasm towards the PQ code.
So to speak. :)
So the key understanding is overlooked.
And what we're after here is the insight anyway. Skipping steps of natural development does not contribute to garnering an insight.
.. and that is of course a matter of personal preference. A genius is perfectly capable of making big leaps across chasms. Heck they might even be able to fly :)
In the postponed filters, we match each prime p with all primes <= sqrt p (again, the composites contribute less). I think that gives a complexity of O(pi(bound)^1.5*log (log bound)) or O(n^1.5*log (log n)) for n primes produced.
Empirically, it was always _below_ 1.5, and above 1.4 , and PQ/merged multiples removal around 1.25..1.17 .
I should really stop doing those calculations in my head, it seems I'm getting too old for that. Doing it properly, on paper, the cost for identifying p_k is pi(sqrt p_k) [trial division by all primes less than sqrt p_k]. p_k is approximately k*log k, so pi(sqrt p_k) is approximately (sqrt $ k* log k)/log (sqrt $ k*log k) ~ 2*sqrt (k/log k). Summing that, the cost for identifying the first n primes is Theta(n^1.5/log n).
that would correspond to what I've seen much better.
It should. Though empirical measurements don't always adhere exactly to the theoretical results, the deviation is usually small (for large enough input).
Using a feeder, i.e. primes = 2:3.sieve feederprimes [5, 7 .. ], where feederprimes are generated as above, to reduce memory pressure and GC impact, that fits pretty well with my measurements of the time to find p_k for several k between 500,000 and 20,000,000.
The quesion of a memory blowup with the treefolding merge still remains. For some reason using its second copy for a feeder doesn't reduce the memory (as reported by standalone compiled program, GHCi reported values are useless) - it causes it to increase twice.....
I have a partial solution. The big problem is that the feeder holds on to the beginning of comps while the main runner holds on to a later part. Thus the entire segment between these two points must be in memory. So have two lists of composites (yeah, you know that, but it didn't work so far). But you have to force the compiler not to share them: enter -fno-cse. The attached code does that (I've also expanded the wheel), it reduces the memory requirements much (a small part is due to the larger wheel, a factor of ~5 due to the non- sharing). It still uses much more memory than the PQ, and while the PQ's memory requirements grow very slowly, the tree-fold merge's still grow rather fast (don't go much beyond the 10,000,000th prime), I'm not sure why.

Daniel Fischer
Am Sonntag 03 Januar 2010 09:54:37 schrieb Will Ness:
Daniel Fischer
writes: But there's a lot of list constructuion and deconstruction necessary for the Euler sieve.
yes. Unless, of course, s smart compiler recognizes there's only one consumer for the values each multiples-list is producing, and keeps it stripped down to a generator function, and its current value. I keep thinkig a smart compiler could eliminate all these "span" calls and replace them with just some pointers manipulating...
Of course I'm no smart compiler, but I don't see how it could be even possible to replace the span calls with pointer manipulation when dealing with lazily generated (infinite, if we're really mean) lists. Even when you're dealing only with strict finite lists, it's not trivial to do efficiently.
I keep thinking that storage duplication with span, filter etc. is not really necessary, and can be replaced with some pointer chasing - especially when there's only one consumer present for the generated values. What I mean is thinking of lists in terms of produce/consumer paradigm, as objects supporting the { pull, peek } interface, keeping the generator inside that would produce the next value on 'pull' request and keep it ready for any 'peek's. Euler's sieve is sieve (p:ps) xs = h ++ sieve ps (t `minus` map (p*) [p,p+2..]) where (h,t) = span (< p*p) xs Everything lives only through access, so (sieve (tail primes) [5,7]) would create an object with the generator which has the 'span' logic inlined: sieve ps xs = make producer such that p := pull ps -- alter ps as well (actually pull a value from it) q := p*p peek = x := peek xs if x < q then x else peek (remake self) pull = x := peek xs if x < q then pull xs else pull (remake self) remake = ys := minus xs (intsFromBy q (2*p)) self := sieve ps ys Here the only thing that gets created are the 'minus' nodes which essentially maintain pointers into the two streams that they consume. 'intsFromBy' only has to maintain two integers inside it (currentVal and step) as there's no need for it to maintain any storage for its results, as they are immediately consumed. A persistent list would be represented by a different kind of producer which would be given a storage to operate on, upon creation (as would the top level variable like 'primes'). The real difference here is between those producers whose values will only be consumed once, by one specific consumer, and those which values may be needed more than once, so need really to be maintained in some storage. If not - span, filter, map, whatever - they all are just little modifiers on top of the real producers, which may or may not also have an actual storage maintained by them.
again, what I mean is, not _where_ I start crossing them off in a PQ, but _when_. The article's code starts crossing them off _at_ p^2 - by adding p^2+2p into the PQ - _as_ _soon_ as p itself is reached. It won't surface until p^2 will be considered for a prime; it'll lay dormant deep inside the queue's guts. When reaching 7919, the thousand (i.e. pi(7919) ) entries will hang out inside the PQ - instead of just 24. A memory blowup. (this is of course fixed in Melissa's ZIP package). Of course due to the nature of PQ it might actually not hurt the performance for a while, depending on partcular PQ implementation. Complexity _and_ constant factors.
It will start to have an impact pretty soon. Assuming at least one of the relevant PQ operations to be Theta(log size), each composite between ~400 and ~400000 (rough estimates) will take something like twice as long to handle. It will start to hurt really badly only a while later, though, as a guesstimate, with more than a million primes in the PQ, memory will have a hard time.
Exactly!
If, on the other hand, you start crossung off at 2*p, when the main sieve is at 10^7, the size of the PQ is > 650000, at 10^8, the size is more than 5.5 million. That starts to become a memory problem rather soon.
here you don't have a choice or when to add it - you have to add it at p itself - so the problem is clear. But even when you cross at p^2, the question remains, of when you add the p^2 entry into the PQ. That was my point.
Postponed Filters code makes this clear, and thus hard to err on. Unfortunately, it wasn't present _in_ the article.
I think that remark was meant to apply to composite removal, not Turner's sieve.
It is right there on page 2, right when the Turner's sieve is presented and discussed. The only explanation that I see is that she thought of it in regards to the imperative code, just as her analysis concentrates only on calculation aspects of the imperative code itself.
To be fair, she writes:
"Let us first describe the original “by hand” sieve algorithm as practiced by Eratosthenes. ... The starting point of p^2 is a pleasing but minor optimization, which can be made because lower multiples will have already been crossed off when we found the primes prior to p. ... (This optimization does not affect the time complexity of the sieve, however, so its absence from the code in Section 1 *is not our cause for worry*.)"
A-HA! But its absense from _*that*_ _*code*_ WAS the *major* cause for worry, as dealing with it worked wonders on its complexity and constant factors. This remark only makes sense in the imperative, mutable-storage setting, as we've already estalished.
So it's in context of the imperative code (although rather numbers on paper than bits in RAM). For imperative (be it array or paper), it is indeed minor; for her PQ sieve, it can be considered minor from a theoretical point of view (doesn't change time complexity), but practically, anything that changes performance by a factor of two or more is only 'minor' for hopelessly inadequate algorithms (whether a computation would take ten or twenty billion years is indeed a minor difference; one hour or two is a major difference).
++
However, as you say, the important point is not whether p's multiples get crossed off starting from 2*p or from p^2, but whether p's multiples get inserted into the queue *when* you look at p or at p^2.
Exactly! Not _where_, but _when_ (in a lifetime of a computational process). And this (I would say, important) observation was missing from the article precisely because of her focusing on the analysis of the _imperative_ algorithm. But if we focus on functional, it is a logical thing to realise, and when implemented shows us the road to the next improvement by a gradual stepwise development. A matter of personal preference. :)
... the cost for identifying p_k is pi(sqrt p_k) [trial division by all primes less than sqrt p_k]. p_k is approximately k*log k, so pi(sqrt p_k) is approximately (sqrt $ k* log k)/log (sqrt $ k*log k) ~ 2*sqrt (k/log k). Summing that, the cost for identifying the first n primes is Theta(n^1.5/log n).
that would correspond to what I've seen much better.
It should.
:)
The quesion of a memory blowup with the treefolding merge still remains. For some reason using its second copy for a feeder doesn't reduce the memory (as reported by standalone compiled program, GHCi reported values are useless) - it causes it to increase twice.....
I have a partial solution. The big problem is that the feeder holds on to the beginning of comps while the main runner holds on to a later part. Thus the entire segment between these two points must be in memory. So have two lists of composites (yeah, you know that, but it didn't work so far).
I did. I duplicated everything. The reported memory was twice bigger. :|
But you have to force the compiler not to share them: enter -fno-cse. The attached code does that (I've also expanded the wheel), it reduces the memory requirements much (a small part is due to the larger wheel, a factor of ~5 due to the non-sharing). It still uses much more memory than the PQ, and while the PQ's memory requirements grow very slowly, the tree-fold merge's still grow rather fast (don't go much beyond the 10,000,000th prime), I'm not sure why.
Great! I will look into the code. Thanks! :) :)
Attachment (V13Primes.hs): text/x-haskell, 3621 bytes
_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe <at> haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

Will Ness wrote:
I keep thinking that storage duplication with span, filter etc. is not really necessary, and can be replaced with some pointer chasing - especially when there's only one consumer present for the generated values.
What I mean is thinking of lists in terms of produce/consumer paradigm, as objects supporting the { pull, peek } interface, keeping the generator inside that would produce the next value on 'pull' request and keep it ready for any 'peek's.
Euler's sieve is
sieve (p:ps) xs = h ++ sieve ps (t `minus` map (p*) [p,p+2..]) where (h,t) = span (< p*p) xs
[...]
The real difference here is between those producers whose values will only be consumed once, by one specific consumer, and those which values may be needed more than once, so need really to be maintained in some storage. If not - span, filter, map, whatever - they all are just little modifiers on top of the real producers, which may or may not also have an actual storage maintained by them.
(I haven't followed the whole thread, but hopefully I have enough grasp of it to make a useful remark. :)) Concerning lists as producer/consumer, I think that's exactly what lazy evaluation is doing. Neither filter , map or span evaluate and store more list elements that strictly necessary. Sure, creating a list head only to immediately consume it is somewhat inefficient -- and the target of stream fusion[1] -- but this is an overhead of how list elements are stored, not how many. You can try to implement the Euler sieve with producers by using a type like data Producer a = forall s. Producer { state :: !s, next :: s -> s, value :: s -> a } but I think this will be quite difficult; it's not clear what and thus how big the state will be. (See [1] for choosing a good type.) Concerning the sieves, there is a fundamental difference between the imperative sieve and the functional sieves, regardless of whether the latter start at p or p^2 or use a priority queue. Namely, the imperative sieve makes essential use of *pointer arithmetic*. The key point is that in order to cross off the multiples p, 2*p, 3*p, ... of a prime, the algorithm can directly jump from the (k*p)-th to the (k*p+p)-th array element by adding p to the index. The functional versions can never beat that because they can't just jump over p constructors of a data structure in O(1) time. [1]: http://www.cse.unsw.edu.au/~dons/papers/CLS07.html Regards, Heinrich Apfelmus -- http://apfelmus.nfshost.com

Heinrich Apfelmus
Will Ness wrote:
I keep thinking that storage duplication with span, filter etc. is not really necessary, and can be replaced with some pointer chasing - especially when there's only one consumer present for the generated values.
What I mean is thinking of lists in terms of produce/consumer paradigm, as objects supporting the { pull, peek } interface, keeping the generator inside that would produce the next value on 'pull' request and keep it ready for any 'peek's.
Euler's sieve is
sieve (p:ps) xs = h ++ sieve ps (t `minus` map (p*) [p,p+2..]) where (h,t) = span (< p*p) xs
[...]
The real difference here is between those producers whose values will only be consumed once, by one specific consumer, and those which values may be needed more than once, so need really to be maintained in some storage. If not - span, filter, map, whatever - they all are just little modifiers on top of the real producers, which may or may not also have an actual storage maintained by them.
(I haven't followed the whole thread, but hopefully I have enough grasp of it to make a useful remark. :))
Concerning lists as producer/consumer, I think that's exactly what lazy evaluation is doing. Neither filter , map or span evaluate and store more list elements that strictly necessary.
I laways suspected as much, but was once told that Chris Okasaki has shown that any filter etc must allocate its own storage. With the peek/pull they don't have to, if they are nested, and the deepest one from the real storage gets pulled through some pointer chasing eventually. Span isn't so easily compiled out too or is it? But that might be a minor point. For me, a real smart compiler is one that would take in e.g. (sum $ take n $ cycle $ [1..m]) and spill out a straight up math formula, inside a few ifs maybe (just an aside). Such a smart compiler might even be able to derive a well performing code right from the Turner's sieve. :)
Sure, creating a list head only to immediately consume it is somewhat inefficient -- and the target of stream fusion[1] -- but this is an overhead of how list elements are stored, not how many.
it might be equivalent to the (imagined) producer's storing its 'current' value inside its frame. How much can we rely on the run-time to actually destroy all the passed-over elements and not hang on to them for some time? Is this that compiler switch that Daniel mentioned? Is it reliable?
You can try to implement the Euler sieve with producers by using a type like
data Producer a = forall s. Producer { state :: !s, next :: s -> s, value :: s -> a }
but I think this will be quite difficult; it's not clear what and thus how big the state will be. (See [1] for choosing a good type.)
I did that once in Scheme, as per SICP, with 'next' hanging in a stream's tail. Put filters and maps on top of that (inside the new 'next' actually). But that used the Scheme's lists as sorage. Another one was actual producers/modifiers with {pull,peek} interface. It even produced some primes, and some Hamming numbers. Then I saw Haskell, and thought I'd get all that for free with its equational reasoning. But I get the impression that GHC isn't working through equational reasoning?.. I see all this talk about thunks etc.
Concerning the sieves, there is a fundamental difference between the imperative sieve and the functional sieves, regardless of whether the latter start at p or p^2 or use a priority queue. Namely, the imperative sieve makes essential use of *pointer arithmetic*. The key point is that in order to cross off the multiples
p, 2*p, 3*p, ...
of a prime, the algorithm can directly jump from the (k*p)-th to the (k*p+p)-th array element by adding p to the index. The functional versions can never beat that because they can't just jump over p constructors of a data structure in O(1) time.
We can directy jump to the next multiple too, it is called (+). :) :) But seriously, the real issue is that we have to merge the produced streams of multiples, while the mutable-storage code works on same one, so its "merging cost" is zero. And even if we are smart to merge them in a tree-like fashion, we still have no (or little) control over the compiler's representation of lists and retention of their content and whether it performs stream fusion or not (if we use lists). If you could take a look at the tree-merging primes and why it uses too much memory, it would be great. The code is in Daniel's post to which I replied, or on haselwiki Prime_numbers page (there in its rudimentary form). It's a tangent to your VIP code, where instead of People structure an ordered list is just maintained as a split pair, of its known (so far, finite) prefix and the rest of it. Then under merging these split pairs form a monoid, s can be rearranged in a tree. If you have'nt seen it yet, it uses a different folding structure to your code, with a lower total cost of multiples production (estimated as Sum (1/p)*depth): tfold f (x:y:z:xs) = (x `f` (y `f` z)) `f` pairwise f xs comps = tfold $ pairwise mergeSP multips But aside from the memory problem (about 50M vs Melissa's 2M), for the first few million primes produced it has almost exactly the same asymptotics as her PQ code and runs on par with it, compiled (and 2.5x faster when interpreted, in GHCi). It is also clear and concise. :) :) I think I'll have to try out your code (amended with a new folding structure) and see if it has less memory problems maybe. I assume it is your code on the haskellwiki page. (?) Cheers,
[1]: http://www.cse.unsw.edu.au/~dons/papers/CLS07.html
Regards, Heinrich Apfelmus

For me, a real smart compiler is one that would take in e.g. (sum $ take n $ cycle $ [1..m]) and spill out a straight up math formula, inside a few ifs maybe (just an aside).
(Also an aside, I couldn't resist...) Then I'm sure you'd say that Feldspar [1] has a smart compiler :) The above expression written in Feldspar and the resulting C code can be found here: http://hpaste.org/fastcgi/hpaste.fcgi/view?id=15592#a15593 The C code is somewhat complicated by the fact that Feldspar doesn't have infinite vectors. Feldspar usually works well on small examples like this one, but we're very much lacking bigger examples, so I can't advise you to use it for prime numbers just yet. / Emil [1] http://feldspar.sourceforge.net/

Emil Axelsson
For me, a real smart compiler is one that would take in e.g. (sum $ take n $ cycle $ [1..m]) and spill out a straight up math formula, inside a few ifs maybe (just an aside).
(Also an aside, I couldn't resist...)
Then I'm sure you'd say that Feldspar [1] has a smart compiler :)
but it didn't produce f n m = if n < m then n*(n+1)/2 else let (q,r)=quotRem n m in q*(m*(m+1)/2) + r*(r+1)/2 :)
The above expression written in Feldspar and the resulting C code can be found here:
http://hpaste.org/fastcgi/hpaste.fcgi/view?id=15592#a15593
/ Emil

Will Ness skrev:
Emil Axelsson
writes: For me, a real smart compiler is one that would take in e.g. (sum $ take n $ cycle $ [1..m]) and spill out a straight up math formula, inside a few ifs maybe (just an aside). (Also an aside, I couldn't resist...)
Then I'm sure you'd say that Feldspar [1] has a smart compiler :)
but it didn't produce
f n m = if n < m then n*(n+1)/2 else let (q,r)=quotRem n m in q*(m*(m+1)/2) + r*(r+1)/2
:)
Ah, I see... Yes, that would be a very smart compiler :) / Emil

Am Montag 04 Januar 2010 16:30:18 schrieb Will Ness:
Heinrich Apfelmus
writes: (I haven't followed the whole thread, but hopefully I have enough grasp of it to make a useful remark. :))
Concerning lists as producer/consumer, I think that's exactly what lazy evaluation is doing. Neither filter , map or span evaluate and store more list elements that strictly necessary.
I laways suspected as much, but was once told that Chris Okasaki has shown that any filter etc must allocate its own storage. With the peek/pull they don't have to, if they are nested, and the deepest one from the real storage gets pulled through some pointer chasing eventually. Span isn't so easily compiled out too or is it? But that might be a minor point.
For me, a real smart compiler is one that would take in e.g. (sum $ take n $ cycle $ [1..m]) and spill out a straight up math formula, inside a few ifs maybe (just an aside).
Such things just aren't common enough. If you start letting the compiler look for patterns such as this which can be transformed into a simple formula, compile times would explode.
Such a smart compiler might even be able to derive a well performing code right from the Turner's sieve. :)
Sure, creating a list head only to immediately consume it is somewhat inefficient -- and the target of stream fusion[1] -- but this is an overhead of how list elements are stored, not how many.
it might be equivalent to the (imagined) producer's storing its 'current' value inside its frame.
How much can we rely on the run-time to actually destroy all the passed-over elements and not hang on to them for some time?
If they're really passed over, they very probably won't survive the next GC.
Is this that compiler switch that Daniel mentioned? Is it reliable?
No, it's something different. Not entirely unrelated. The -fno-cse turns off Common Subexpression Elimination (rather sharing than elimination). That is, if you have f x = g (expensive x) y z (h (expensive x)) the compiler can see the common subexpression (expensive x) on the RHS and decide to share it, i.e. calculate it only once: f x = let e = expensive x in g e y z (h e) That may or may not be a good idea. If e is small (say an Int) and expensive isn't a blatant lie, it is a *very* good idea. But if e is a loong list that expensive x lazily produces and that is consumed by g and h in the right manner, it is often a bad idea because h might not start to consume before g has demanded a long prefix of the list and kaboom, Heap exhausted. If you turn on optimisations, the compiler does some looking for common subexpressions to share. There are some heuristics to decide when to share, but they're far from infallible. So, if you have a situation like above, and the compiler heuristics indicate that sharing would be good although it isn't, you're hosed. So you have the option to tell the compiler "try hard to optimise, but don't do CSE (because I know it would be bad here)" {- that can be done on a per-file basis, it would be cool if it could be done on a per-function basis -}. Now if you have a list producer and a consumer, without fusion, it goes like Consumer: Gimme Producer: creates cons cell, puts value, next cons cell (how to produce more) Consumer: takes value, does something with it, gimme more. Stream fusion is about eliminating the cons cell creating and value passing, to fuse production and consumption into a nice loop. That is of course impossible if the produced list is shared between two (or more) consumers.

Daniel Fischer
Am Montag 04 Januar 2010 16:30:18 schrieb Will Ness:
For me, a real smart compiler is one that would take in e.g. (sum $ take n $ cycle $ [1..m]) and spill out a straight up math formula, inside a few ifs maybe (just an aside).
Such things just aren't common enough. If you start letting the compiler look for patterns such as this which can be transformed into a simple formula, compile times would explode.
I was thinking more along the lines of inferencing compiler, proving new theorems about the types and applying that knowledge in simplifying the expressions. This would take time, so it should be a part of some interactive system, maybe kind of like Lisp has. In such a setting, the underlying compiler could first produce quick-n-dirty version, and would continue working in the background whenever the system is not busy, trying to improve the executable. Such a system would probably have to distinguish, at the type level, between [1..m] ; cycle [1..m] ; take n [1..m] ; etc. These would all be not just fuctions, but parts of a type's (here list) behaviour with automatically deduced semantics. What would such a type system be called?
The -fno-cse turns off Common Subexpression Elimination (rather sharing than elimination).
That is, if you have
f x = g (expensive x) y z (h (expensive x))
the compiler can see the common subexpression (expensive x) on the RHS and decide to share it, i.e. calculate it only once:
f x = let e = expensive x in g e y z (h e)
........
thanks for the in-depth explanation! :)
Now if you have a list producer and a consumer, without fusion, it goes like Consumer: Gimme Producer: creates cons cell, puts value, next cons cell (how to produce more) Consumer: takes value, does something with it, gimme more.
Stream fusion is about eliminating the cons cell creating and value passing, to fuse production and consumption into a nice loop. That is of course impossible if the produced list is shared between two (or more) consumers.
I would imagine so. Do I get this fusion on lists for free from the compiler, or do I have to recode for that? (haven't yet time to look into the article mentioned).
_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe <at> haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

Am Dienstag 05 Januar 2010 10:33:19 schrieb Will Ness:
Such a system would probably have to distinguish, at the type level, between [1..m] ; cycle [1..m] ; take n [1..m] ; etc. These would all be not just fuctions, but parts of a type's (here list) behaviour with automatically deduced semantics.
What would such a type system be called?
Seriously complicated, I think. Don't get me wrong, a system with such capabilities would be A.W.E.S.O.M.E. I just can't see it happening anytime soon.
I would imagine so. Do I get this fusion on lists for free from the compiler, or do I have to recode for that? (haven't yet time to look into the article mentioned).
Without optimisations, hardly ever if at all (needs a compiler expert to know). With optimisations, sometimes. But it doesn't always see the possibility where it would with recoding in the right style (dons is expert in that). With stream fusion, more often. But I think it would again be not very hard to hide opportunities for fusion by your coding style.

Will Ness wrote:
Heinrich Apfelmus writes:
Concerning lists as producer/consumer, I think that's exactly what lazy evaluation is doing. Neither filter , map or span evaluate and store more list elements that strictly necessary.
I laways suspected as much, but was once told that Chris Okasaki has shown that any filter etc must allocate its own storage. With the peek/pull they don't have to, if they are nested, and the deepest one from the real storage gets pulled through some pointer chasing eventually. Span isn't so easily compiled out too or is it? But that might be a minor point.
Hm? In my world view, there is only reduction to normal form and I don't see how "allocate its own storage" fits in there. Okasaki having shown something to that end would be new to me?
I did that once in Scheme, as per SICP, with 'next' hanging in a stream's tail. Put filters and maps on top of that (inside the new 'next' actually). But that used the Scheme's lists as sorage. Another one was actual producers/modifiers with {pull,peek} interface. It even produced some primes, and some Hamming numbers. Then I saw Haskell, and thought I'd get all that for free with its equational reasoning.
But I get the impression that GHC isn't working through equational reasoning?.. I see all this talk about thunks etc.
Sure it does. Concerning the thunks, they're part of the implementation of the reduction model (call-by-need aka lazy evaluation).
Concerning the sieves, there is a fundamental difference between the imperative sieve and the functional sieves, regardless of whether the latter start at p or p^2 or use a priority queue. [...]
We can directy jump to the next multiple too, it is called (+). :) :)
But seriously, the real issue is that we have to merge the produced streams of multiples, while the mutable-storage code works on same one, so its "merging cost" is zero. And even if we are smart to merge them in a tree-like fashion, we still have no (or little) control over the compiler's representation of lists and retention of their content and whether it performs stream fusion or not (if we use lists).
Not quite, I think there are two things going on: 1. In contrast to the standard imperative version, we are implementing an on-line algorithm that produces arbitrarily many primes on demand. Any imperative on-line version will need to think about storage for pending prime filters as well. 2. Even the imperative algorithm can be interpreted as "merging" arrays, just that the composites are magically merged at the right place (at distance p from each other) because arrays offer O(1) "jumps". In contrast, the functional version needs O(log something) time to calculate where to put the next composite.
If you could take a look at the tree-merging primes and why it uses too much memory, it would be great.
Fortunately, Daniel already took a detailed look. :) But I also have a few remarks.
The code is in Daniel's post to which I replied, or on haskellwiki Prime_numbers page (there in its rudimentary form). It's a tangent to your VIP code, where instead of People structure an ordered list is just maintained as a split pair, of its known (so far, finite) prefix and the rest of it.
Aww, why remove the cutesy name? The VIPs will be angry for being ignored! Jest aside, note that putting the crowd at the end of the list where it belongs has a reason: you have a guarantee that crowds won't take any memory until they actually appear after all the VIPs. If you put both VIPs and the crowd into a pair, it's easier to get space leaks. In particular, I think that your mergeSP has a laziness bug, it should be mergeSP ~(a,b) ~(c,d) = ... This is a cure for the (potential) problem that spMerge first performs a pattern match / comparison and then returns a pair constructor instead of the other way round. For instance, your second case spMerge u [] = ([], u) should actually behave like spMerge u v = (a,b) where a = if null v then [] else ... b = if null v then u else ... (I don't recommend rewriting spMerge in this fashion, the lazy pattern in mergeSP will do the trick.) Now, I'm not sure whether this is actually a problem or not. But the version shown here, with two lazy patterns instead of just one, is much closer to how the original People structure behaves.
Then under merging these split pairs form a monoid, s can be rearranged in a tree. If you haven't seen it yet, it uses a different folding structure to your code, with a lower total cost of multiples production (estimated as Sum (1/p)*depth):
tfold f (x:y:z:xs) = (x `f` (y `f` z)) `f` pairwise f xs comps = tfold $ pairwise mergeSP multips
The idea being that each prime produces 1/p composites each "turn" and it takes time depth to trickle it to the top? Sounds reasonable, but remember that a prime p does not start to generate composites until the "turn count" has reached p^2, so it might occupy a place of low depth in the tree that is better spent on the previous primes. But your experiments seem to tell that it works anyway. By the way, I think that both tfold and pairwise need to be lazier. tfold f ~(x: ~(y: ~(z:zs))) = (x `f` (y `f` z)) `f` pairwise f zs pairwise f ~(x: ~(y:ys)) = f x y : pairwise f ys Otherwise, large parts of the tree might be generated too early and hog memory. Also, a small remark on style: I don't like the repetition of mergeSP in compos = fst $ tfold mergeSP (pairwise mergeSP (multip ps)) After all, the right hand side is just another tree fold and I would clearly mark it as such: compos = fst . superSmartTreeFold mergeSP . multip $ ps superSmartTreeFold f = tfold f . pairwise f ;)
I think I'll have to try out your code (amended with a new folding structure) and see if it has less memory problems maybe. > I assume it is your code on the haskellwiki page. (?)
Yes, I put the code snippet on the wiki. And never tested it on large numbers. ;) Regards, Heinrich Apfelmus -- http://apfelmus.nfshost.com

Am Donnerstag 07 Januar 2010 11:43:44 schrieb Heinrich Apfelmus:
Will Ness wrote:
Heinrich Apfelmus writes:
Concerning lists as producer/consumer, I think that's exactly what lazy evaluation is doing. Neither filter , map or span evaluate and store more list elements that strictly necessary.
I laways suspected as much, but was once told that Chris Okasaki has shown that any filter etc must allocate its own storage. With the peek/pull they don't have to, if they are nested, and the deepest one from the real storage gets pulled through some pointer chasing eventually. Span isn't so easily compiled out too or is it? But that might be a minor point.
Hm? In my world view, there is only reduction to normal form and I don't see how "allocate its own storage" fits in there. Okasaki having shown something to that end would be new to me?
Perhaps what was meant was "storage must be allocated for each filter" (which, however, seems trivial).
We can directy jump to the next multiple too, it is called (+). :) :)
But seriously, the real issue is that we have to merge the produced streams of multiples, while the mutable-storage code works on same one, so its "merging cost" is zero. And even if we are smart to merge them in a tree-like fashion, we still have no (or little) control over the compiler's representation of lists and retention of their content and whether it performs stream fusion or not (if we use lists).
Not quite, I think there are two things going on:
1. In contrast to the standard imperative version, we are implementing an on-line algorithm that produces arbitrarily many primes on demand. Any imperative on-line version will need to think about storage for pending prime filters as well.
If you consider a segmented array sieve an on-line algorithm, at the expense of speed, you can omit storage for prime filters. It will not be a real Eratosthenes sieve anymore, but it won't be too horribly slow, I think. If you're using an Atkin sieve, the performance penalty should be even lower.
2. Even the imperative algorithm can be interpreted as "merging" arrays, just that the composites are magically merged at the right place (at
And they're merged asynchronously. Yet more magic :)
distance p from each other) because arrays offer O(1) "jumps". In contrast, the functional version needs O(log something) time to calculate where to put the next composite.
If you could take a look at the tree-merging primes and why it uses too much memory, it would be great.
Fortunately, Daniel already took a detailed look. :) But I also have a few remarks.
The code is in Daniel's post to which I replied, or on haskellwiki Prime_numbers page (there in its rudimentary form). It's a tangent to your VIP code, where instead of People structure an ordered list is just maintained as a split pair, of its known (so far, finite) prefix and the rest of it.
Aww, why remove the cutesy name? The VIPs will be angry for being ignored!
You'll be happy to learn that they return, albeit in a different form, then :)
Jest aside, note that putting the crowd at the end of the list where it belongs has a reason: you have a guarantee that crowds won't take any memory until they actually appear after all the VIPs. If you put both VIPs and the crowd into a pair, it's easier to get space leaks.
Indeed. It's the combination of pairs and the bushy merge trees that lets the memory explode. I still don't completely understand the behaviour, but I managed to squash the space leak :D (see below). Thanks for the idea.
In particular, I think that your mergeSP has a laziness bug, it should be
mergeSP ~(a,b) ~(c,d) = ...
That doesn't help. The pattern match on the first argument doesn't hurt.
This is a cure for the (potential) problem that spMerge first performs a pattern match / comparison and then returns a pair constructor instead of the other way round. For instance, your second case
spMerge u [] = ([], u)
should actually behave like
spMerge u v = (a,b) where a = if null v then [] else ... b = if null v then u else ...
(I don't recommend rewriting spMerge in this fashion, the lazy pattern in mergeSP will do the trick.)
It didn't. Probably rewriting spMerge thus would do, but I think I'm too lazy to find out.
Now, I'm not sure whether this is actually a problem or not. But the version shown here, with two lazy patterns instead of just one, is much closer to how the original People structure behaves.
Then under merging these split pairs form a monoid, s can be rearranged in a tree. If you haven't seen it yet, it uses a different folding structure to your code, with a lower total cost of multiples production (estimated as Sum (1/p)*depth):
tfold f (x:y:z:xs) = (x `f` (y `f` z)) `f` pairwise f xs comps = tfold $ pairwise mergeSP multips
The idea being that each prime produces 1/p composites each "turn" and it takes time depth to trickle it to the top? Sounds reasonable, but remember that a prime p does not start to generate composites until the "turn count" has reached p^2, so it might occupy a place of low depth in the tree that is better spent on the previous primes. But your experiments seem to tell that it works anyway.
By the way, I think that both tfold and pairwise need to be lazier.
tfold f ~(x: ~(y: ~(z:zs))) = (x `f` (y `f` z)) `f` pairwise f zs
pairwise f ~(x: ~(y:ys)) = f x y : pairwise f ys
Otherwise, large parts of the tree might be generated too early and hog memory.
As it turns out, I don't know why, it's the other way round. Each lazy pattern increases memory usage. Not very intuitive.
Also, a small remark on style: I don't like the repetition of mergeSP in
compos = fst $ tfold mergeSP (pairwise mergeSP (multip ps))
Noted and acted upon.
After all, the right hand side is just another tree fold and I would clearly mark it as such:
compos = fst . superSmartTreeFold mergeSP . multip $ ps
superSmartTreeFold f = tfold f . pairwise f
;)
I think I'll have to try out your code (amended with a new folding structure) and see if it has less memory problems maybe. > I assume it is your code on the haskellwiki page. (?)
Yes, I put the code snippet on the wiki. And never tested it on large numbers. ;)
Regards, Heinrich Apfelmus
The below code is now a well-behaved memory citizen (3MB for the 100 millionth prime, about the same as the PQ code). It still is considerably slower than the PQ code. In terms of MUT times as reported by +RTS -sstderr - as well as (MUT+GC) times - (measured once for prime No. 5*10^5, 10^6, 2*10^6, 4*10^6, 10^7 to get a rough tendency), it seems to scale a wee bit better than any of the other tfold versions I created, but a little worse than the PQ versions. The relation of MUT times isn't too bad, but the GC times are rather abysmal (30-40%). ---------------------------------------------------------------------- {-# LANGUAGE ScopedTypeVariables #-} {-# OPTIONS_GHC -O2 -fno-cse #-} module VippyPrimes (primes) where data People a = P { vips :: [a], dorks :: [a] } {-# SPECIALISE celebrate :: Int -> People Int -> People Int #-} {-# SPECIALISE celebrate :: Integer -> People Integer -> People Integer #-} celebrate :: a -> People a -> People a celebrate x p = P (x:vips p) (dorks p) {-# SPECIALISE primes :: () -> [Int] #-} {-# SPECIALISE primes :: () -> [Integer] #-} primes :: forall a. Integral a => () -> [a] primes () = 2:3:5:7:11:13:primes' where primes' = roll 17 wheel13 `minus` compos primes''' primes'' = 17:19:23:29:31:37:rollFrom 41 `minus` compos primes'' primes''' = 17:19:23:29:31:37:rollFrom 41 `minus` compos primes'' pmults :: a -> People a pmults p = case map (*p) (rollFrom p) of (x:xs) -> P [x] xs multip :: [a] -> [People a] multip ps = map pmults ps compos :: [a] -> [a] compos = vips . itfold mergeSP . multip rollFrom n = go ((n-17) `rem` 30030) wheel13 where go r xxs@(x:xs) | r < x = roll n xxs | otherwise = go (r-x) xs smartfold f = tfold f . pairwise f tfold f (a:b:c:xs) = (a `f` (b `f` c)) `f` smartfold f xs pairwise f (x:y:ys) = f x y : pairwise f ys {-# SPECIALISE mergeSP :: People Int -> People Int -> People Int #-} {-# SPECIALISE mergeSP :: People Integer -> People Integer -> People Integer #-} mergeSP :: Integral a => People a -> People a -> People a mergeSP p1@(P a _) p2 = P (a ++ vips mrgd) (dorks mrgd) where mrgd = spMerge (dorks p1) (vips p2) (dorks p2) spMerge l1 [] l3 = P [] (merge l1 l3) spMerge ~l1@(x:xs) l2@(y:ys) l3 = case compare x y of LT -> celebrate x (spMerge xs l2 l3) EQ -> celebrate x (spMerge xs ys l3) GT -> celebrate y (spMerge l1 ys l3) -- do the sepcialisations below make a difference at all? {-# SPECIALISE roll :: Int -> [Int] -> [Int] #-} {-# SPECIALISE roll :: Integer -> [Integer] -> [Integer] #-} roll :: Integral a => a -> [a] -> [a] roll = scanl (+) {-# SPECIALISE wheel :: [Int] #-} {-# SPECIALISE wheel :: [Integer] #-} wheel :: Integral a => [a] wheel = 2:4:2:4:6:2:6:4:2:4:6:6:2:6:4:2:6:4:6:8:4:2:4:2: 4:8:6:4:6:2:4:6:2:6:6:4:2:4:6:2:6:4:2:4:2:10:2:10:wheel {-# SPECIALISE wheel11 :: [Int] #-} {-# SPECIALISE wheel11 :: [Integer] #-} wheel11 :: Integral a => [a] wheel11 = res where snms = scanl (+) 11 (take 47 wheel) nums = tail $ scanl (+) 11 (take 529 wheel) cops = nums `minus` map (*11) snms diffs = zipWith (-) (tail cops) cops res = foldr (:) res diffs {-# SPECIALISE wheel13 :: [Int] #-} {-# SPECIALISE wheel13 :: [Integer] #-} wheel13 :: Integral a => [a] wheel13 = res where snms = take 480 $ scanl (+) 13 wheel11 nums = take (480*13+1) . tail $ scanl (+) 13 wheel11 cops = nums `minus` map (*13) snms diffs = zipWith (-) (tail cops) cops res = foldr (:) res diffs {-# SPECIALISE minus :: [Int] -> [Int] -> [Int] #-} {-# SPECIALISE minus :: [Integer] -> [Integer] -> [Integer] #-} minus :: Integral a => [a] -> [a] -> [a] minus a@(x:xs) b@(y:ys) = case compare x y of LT -> x: xs `minus` b GT -> a `minus` ys EQ -> xs `minus` ys minus a b = a merge a@(x:xs) b@(y:ys) = case compare x y of LT -> x: merge xs b EQ -> x: merge xs ys GT -> y: merge a ys merge a b = if null b then a else b ----------------------------------------------------------------------

Am Donnerstag 07 Januar 2010 17:13:38 schrieb Daniel Fischer:
compos :: [a] -> [a] compos = vips . itfold mergeSP . multip
Sigh! That's what I get for editing the code in the mail editor. I decided to change the really stupid 'itfold' to 'smartfold' and forgot this occurrence.

Daniel Fischer
Am Donnerstag 07 Januar 2010 11:43:44 schrieb Heinrich Apfelmus:
Will Ness wrote:
Hm? In my world view, there is only reduction to normal form and I don't see how "allocate its own storage" fits in there. Okasaki having shown something to that end would be new to me?
Perhaps what was meant was "storage must be allocated for each filter" (which, however, seems trivial).
I still contend that in case of nested filters, where each filter has only one consumer, even that isn't ultimately necessary (a chain of pointers could be formed). That's because there's no "peek"s, only "pull"s there. If the filters are used in merge situation, then there will be some "peek"s so current value must be somehow stored, though it's better to be made explicit and thus static (the place for it, I mean, instead of the "rolling" cons cell). Such implementation technique would prevent some extra gc.
Then under merging these split pairs form a monoid, so can be rearranged in a tree. If you haven't seen it yet, it uses a different folding structure to your code, with a lower total cost of multiples production (estimated as Sum (1/p)*depth):
correction:
tfold f (x:y:z:xs) = (x `f` (y `f` z)) `f` tfold f (pairwise f xs) comps = tfold mergeSP $ pairwise mergeSP multips
The idea being that each prime produces 1/p composites each "turn" and it takes time depth to trickle it to the top? Sounds reasonable, but remember that a prime p does not start to generate composites until the "turn count" has reached p^2, so it might occupy a place of low depth in the tree that is better spent on the previous primes.
That might be why Daniel's structure is better: it plunges down faster than mine. "treefold" structure was: (2+4) + ( (4+8) + ( (8+16) + ( (16+32) + ( (32+64) + ....... )))) dpths: 3 4 4 5 5 6 6 7 7 8 daniel's: (2+(4+6)) + ( (8+(10+12)) + ( (14+(16+18)) + ( (20+(22+24)) + .... )) 3 5 5.4 6 7.8 7.9 8 9 9.5 9.6 10.7 10.8
primes () = 2:3:5:7:11:13:primes' where primes' = roll 17 wheel13 `minus` compos primes''' primes'' = 17:19:23:29:31:37:rollFrom 41 `minus` compos primes'' primes''' = 17:19:23:29:31:37:rollFrom 41 `minus` compos primes''
Haven't read through the whole thing yet. :) :) I thought there was a typo there. There isn't. BTW using the no-share switch isn't necessary if we just write it down twice with some variations, like primes''' = let (h,t)=span (< 17^2) roll 17 wheel13 in h++t `minus` compos primes'' As we've found out, compilers aren't _that_ smart yet.

Will Ness
That might be why Daniel's structure is better: it plunges down faster than mine.
"treefold" structure was: (2+4) + ( (4+8) + ( (8+16) + ( (16+32) + ( (32+64) + ....... )))) dpths: 3 4 4 5 5 6 6 7 7 8
this should of course have been dpths: 3 4 5 6 7 8 9 10 11 12
daniel's: (2+(4+6)) + ( (8+(10+12)) + ( (14+(16+18)) + ( (20+(22+24)) + .... )) 3 5 5.4 6 7.8 7.9 8 9 9.5 9.6 10.7 10.8
hmm. :|

Am Freitag 08 Januar 2010 18:37:21 schrieb Will Ness:
Daniel Fischer
writes: Am Donnerstag 07 Januar 2010 11:43:44 schrieb Heinrich Apfelmus:
Will Ness wrote:
Hm? In my world view, there is only reduction to normal form and I don't see how "allocate its own storage" fits in there. Okasaki having shown something to that end would be new to me?
Perhaps what was meant was "storage must be allocated for each filter" (which, however, seems trivial).
I still contend that in case of nested filters, where each filter has only one consumer, even that isn't ultimately necessary (a chain of pointers could be formed). That's because there's no "peek"s, only "pull"s there. If the filters are used in merge situation, then there will be some "peek"s so current value must be somehow stored, though it's better to be made explicit and thus static (the place for it, I mean, instead of the "rolling" cons cell). Such implementation technique would prevent some extra gc.
Well, for each filter we need to store 1. a pointer to where in the wheel we are 2. the prime by which to multiply the steps of the wheel 3. the current value of the filter (which number to reject next) I don't see how to use less. It's not much, but it's not zero storage.

Daniel Fischer
The below code is now a well-behaved memory citizen (3MB for the 100
millionth prime, about the same as the PQ code). It still is considerably slower than the PQ code.
In terms of MUT times as reported by +RTS -sstderr - as well as (MUT+GC) times - (measured once for prime No. 5*10^5, 10^6, 2*10^6, 4*10^6, 10^7 to get a rough tendency), it seems to scale a wee bit better than any of the other tfold versions I created, but a little worse than the PQ versions. The relation of MUT times isn't too bad, but the GC times are rather abysmal (30-40%).
---------------------------------------------------------------------- data People a = P { vips :: [a], dorks :: [a] }
celebrate :: a -> People a -> People a celebrate x p = P (x:vips p) (dorks p)
primes :: forall a. Integral a => () -> [a] primes () = 2:3:5:7:11:13:primes' where primes' = roll 17 wheel13 `minus` compos primes''' primes'' = 17:19:23:29:31:rollFrom 37 `minus` compos primes'' primes''' = 17:19:23:29:31:37:rollFrom 41 `minus` compos primes''
pmults :: a -> People a pmults p = case map (*p) (rollFrom p) of (x:xs) -> P [x] xs
multip :: [a] -> [People a] multip ps = map pmults ps
compos :: [a] -> [a] compos = vips . smartfold mergeSP . multip
smartfold f = tfold f . pairwise f
tfold f (a:b:c:xs) = (a `f` (b `f` c)) `f` smartfold f xs
pairwise f (x:y:ys) = f x y : pairwise f ys
mergeSP :: Integral a => People a -> People a -> People a mergeSP p1@(P a _) p2 = P (a ++ vips mrgd) (dorks mrgd) where mrgd = spMerge (dorks p1) (vips p2) (dorks p2) spMerge l1 [] l3 = P [] (merge l1 l3) spMerge ~l1@(x:xs) l2@(y:ys) l3 = case compare x y of LT -> celebrate x (spMerge xs l2 l3) EQ -> celebrate x (spMerge xs ys l3) GT -> celebrate y (spMerge l1 ys l3)
----------------------------------------------------------------------
Hi Daniel, Is it so that you went back to my fold structure? Was it better for really big numbers of primes? I had the following for ages (well, at least two weeks) but I thought it was slower and took more memory (that was _before_ the 'no-share' and 'feeder' stuff). I can see the only difference in that you've re-written spMerge in a tail-recursive style with explicitly deconstructed parts; mine was relying on the compiler (8-L) to de-couple the two pipes and recognize that the second just passes along the final result, unchanged. The two versions seem to me to be _exactly_ operationally equivalent. All this searching for the code better understood by the compiler is _*very*_ frustrating, as it doesn't reflect on the semantics of the code, or even the operational semantics of the code. :-[ Weren't the P's supposed to disappear completely in the compiled code? Aren't types just a _behavioral_ definitions??? Aren't we supposed to be able to substitute equals for equals dammit?? Is this the state of our _best_ Haskell compiler???? module Primes8 where import Data.Monoid data (Ord a) => SplitList a = P [a] [a] instance (Ord a) => Monoid (SplitList a) where mempty = P [] [] -- {x | x::SplitList a} form a monoid under mappend mappend (P a b) ~(P c d) = let P bc b' = spMerge b c in P (a ++ bc) (merge b' d) where spMerge :: (Ord a) => [a] -> [a] -> SplitList a spMerge u@(x:xs) w@(y:ys) = case compare x y of LT -> P (x:c) d where (P c d) = spMerge xs w EQ -> P (x:c) d where (P c d) = spMerge xs ys GT -> P (y:c) d where (P c d) = spMerge u ys spMerge u [] = P [] u spMerge [] w = P [] w mconcat ms = fold mappend (pairwise mappend ms) where fold f (a: ~(b: ~(c:xs))) = (a `f` (b `f` c)) `f` fold f (pairwise f xs) pairwise f (x:y:ys) = f x y:pairwise f ys primes :: Integral a => () -> [a] primes () = 2:3:5:7:primes' where primes' = [11,13] ++ drop 2 (rollFrom 11) `minus` comps mults = map (\p-> P [p*p] [p*n | n<- tail $ rollFrom p]) $ primes' P comps _ = mconcat mults

Am Freitag 08 Januar 2010 19:45:47 schrieb Will Ness:
Daniel Fischer
writes:
mergeSP :: Integral a => People a -> People a -> People a mergeSP p1@(P a _) p2 = P (a ++ vips mrgd) (dorks mrgd) where mrgd = spMerge (dorks p1) (vips p2) (dorks p2) spMerge l1 [] l3 = P [] (merge l1 l3) spMerge ~l1@(x:xs) l2@(y:ys) l3 = case compare x y of LT -> celebrate x (spMerge xs l2 l3) EQ -> celebrate x (spMerge xs ys l3) GT -> celebrate y (spMerge l1 ys l3)
----------------------------------------------------------------------
Hi Daniel,
Is it so that you went back to my fold structure?
Yes.
Was it better for really big numbers of primes?
Yes, it is slower for <= 20 million primes (and some beyond), but faster for >= 50 million (and some before), according to the few tests I made.
I had the following for ages (well, at least two weeks) but I thought it was slower and took more memory (that was _before_ the 'no-share' and 'feeder' stuff). I can see the only difference in that you've re-written spMerge in a tail-recursive style with explicitly deconstructed parts;
It's not tail-recursive, the recursive call is inside a celebrate. As it turns out, the important things are 1. a feeder and separate lists of multiples for the feeder and the runner, for the reasons detailed earlier (two-step feeding and larger wheel are pleasing but minor optimisations). 2. a strict pattern in tfold 3. moving the merge inside spMerge (you can have mergeSP (a,b) ~(c,d) = (a ++ bc, m) where (bc,m) = spMerge b c spMerge u [] = ([],merge u d) ... without exploding memory, but it's *much* slower than letting spMerge take three arguments) Now, I have to admit, the only point I _really_ understand is 1. (and why the three-argument spMerge is faster than the two-argument one taking the merge-partner from mergeSP's binding :). Why has mergeSP (a,b) ~(c,d) = let (bc,b') = spMerge b c in (a ++ bc, merge b' d) a memory leak, but mergeSP (a,b) ~(c,d) = let (bc,m) = spMerge' b c d in (a ++ bc, m) not? Well, looking at the core for mergeSP, the fog clears somewhat. The former is translated roughly to mergeSP (a,b) pair = let sm = spMerge b (fst pair) in (a ++ fst sm, merge (snd sm) (snd pair)) It holds on to the pair until the result of the merge is demanded, that is until the entire (a ++ fst sm) is consumed. Only then is the pair released and can be collected. On top of that, as soon as a is consumed and (fst sm) [or bc] is demanded, spMerge forces the evaluation of (fst pair) [c]. After a few levels, the evaluated list will take more space than the thunk. It cannot yet be collected, because pair is still alive. The elements have to be duplicated for (fst sm), because they're intertwined with those of b. On the next level of merging, they have to be duplicated again. The latter is translated roughly to mergeSP (a,b) pair = let sm = spMerge' b (fst pair) (snd pair) in (a ++ fst sm, snd sm) The pair is released as soon as the result of the spMerge' is demanded, that is, when a is consumed. Then the elements of (fst pair) need not be duplicated and they can be discarded almost immediately after they have been produced [for small primes, multiples of larger primes live a little longer, but those are fewer]. So, no duplication, shorter lifespan => no leak. Having seen that, the question is, why can't the compiler see that deconstructing the pair when the first component is needed is better? The first component of the pair is used in no other place, so keeping the pair can't have any advantage, can it? And why does tfold f (a: ~(b: ~(c:xs))) = ... leak, but not tfold f (a:b:c:xs) = ... ? I guess it's similar. tfold f (a: ~(b: ~(c:xs))) = (a `f` (b `f` c)) `f` tfold f ([pairwise f] xs) is tfold f (a:zs) = (a `f` (head zs `f` (head $ tail zs))) `f` tfold f (pairwise f $ drop 2 zs) the latter part holds on to the beginning of zs, leading again to data duplication and too long lifespans.
mine was relying on the compiler (8-L) to de-couple the two pipes and recognize that the second just passes along the final result, unchanged.
The two versions seem to me to be _exactly_ operationally equivalent.
Well, they're not. The main difference is what we told the compiler _when_ to deconstruct the pairs. You said "at the latest possible moment", I said "as soon as we need the first component". It's not entirely obvious, but it is frequently said that understanding the space (and time) behaviour of lazy evaluation isn't always easy.
All this searching for the code better understood by the compiler is _*very*_ frustrating, as it doesn't reflect on the semantics of the code, or even the operational semantics of the code. :-[
Weren't the P's supposed to disappear completely in the compiled code?
No. Constructors for data types can't disappear completely, only newtype constructors can (and do).
Aren't types just a _behavioral_ definitions??? Aren't we supposed to be able to substitute equals for equals dammit??
We can. But substituting equals for equals can alter space and time complexity. fromInteger 0 == fromInteger (let xs = [1 .. 10^8] in (product xs + sum xs) `mod` 10)
Is this the state of our _best_ Haskell compiler????
Yes. It's still a "do what I tell you to" compiler, even if a pretty slick one, not a "do what I mean" compiler. Sometimes, what you tell the compiler isn't what you wanted. It's easier to predict when you give detailed step by step instructions.

Daniel Fischer
Am Freitag 08 Januar 2010 19:45:47 schrieb Will Ness:
Daniel Fischer
writes: It's not tail-recursive, the recursive call is inside a celebrate.
It is (spMerge that is). It calls tail-recursive celebrate in a tail position. What you've done, is to eliminate the outstanding context, buy moving it inward. Your detailed explanation is more clear than that. :) BTW when I run VIP code it is consistently slower than using just pairs, modified with wheel and feeder and all. So what's needed is to re-implement your approach for pairs: mergeSP (a,b) ~(c,d) = let (bc,bd) = spMerge b c d in (a ++ bc, bd) where spMerge u [] d = ([], merge u d) spMerge u@(x:xs) w@(y:ys) d = case compare x y of LT -> consSP x $ spMerge xs w d EQ -> consSP x $ spMerge xs ys d GT -> consSP y $ spMerge u ys d consSP x ~(a,b) = (x:a,b) -- don't forget that magic `~` !!! BTW I'm able to eliminate sharing without a compiler switch by using mtwprimes () = 2:3:5:7:primes where primes = doPrimes 121 primes doPrimes n prs = let (h,t) = span (< n) $ rollFrom 11 in h ++ t `diff` comps prs doPrimes2 n prs = let (h,t) = span (< n) $ rollFrom (12-1) in h ++ t `diff` comps prs mtw2primes () = 2:3:5:7:primes where primes = doPrimes 26 primes2 primes2 = doPrimes2 121 primes2 Using 'splitAt 26' in place of 'span (< 121)' didn't work though. How about them wheels? :)
Yes. It's still a "do what I tell you to" compiler, even if a pretty slick one, not a "do what I mean" compiler. Sometimes, what you tell the compiler isn't what you wanted. It's easier to predict when you give detailed step by step instructions.

Am Samstag 09 Januar 2010 08:04:20 schrieb Will Ness:
Daniel Fischer
writes: Am Freitag 08 Januar 2010 19:45:47 schrieb Will Ness:
Daniel Fischer
writes: It's not tail-recursive, the recursive call is inside a celebrate.
It is (spMerge that is).
No. "In computer science, tail recursion (or tail-end recursion) is a special case of recursion in which the last operation of the function, the tail call, is a recursive call." The last operation of spMerge is a call to celebrate or the pair constructor (be that P or (,)). Doesn't matter, though, as for lazy languages, tail recursion isn't very important.
It calls tail-recursive celebrate in a tail position. What you've done, is to eliminate the outstanding context, by moving it inward. Your detailed explanation is more clear than that. :)
BTW when I run VIP code it is consistently slower than using just pairs,
I can't reproduce that. Ceteris paribus, I get the exact same allocation and GC figures whether I use People or (,), running times identical enough (difference between People and (,) is smaller than the difference between runs of the same; the difference between the fastest and the slowest run of the two is less than 0.5%). I think it must be the other changes you made.
modified with wheel and feeder and all. So what's needed is to re-implement your approach for pairs:
mergeSP (a,b) ~(c,d) = let (bc,bd) = spMerge b c d in (a ++ bc, bd) where spMerge u [] d = ([], merge u d) spMerge u@(x:xs) w@(y:ys) d = case compare x y of LT -> consSP x $ spMerge xs w d EQ -> consSP x $ spMerge xs ys d GT -> consSP y $ spMerge u ys d
consSP x ~(a,b) = (x:a,b) -- don't forget that magic `~` !!!
I called that (<:).
BTW I'm able to eliminate sharing without a compiler switch by using
Yes, I can too. But it's easy to make a false step and trigger sharing. I can get a nice speedup (~15%, mostly due to much less garbage collecting) by doing the final merge in a function without unnecessarily wrapping the result in a pair (whose secondcomponent is ignored): -- Doesn't need -fno-cse anymore, -- but it needs -XScopedTypeVariables for the local type signatures primes :: forall a. Integral a => () -> [a] primes () = 2:3:5:7:11:13:calcPrimes 17 primes'' where calcPrimes s cs = rollFrom s `minus` compos cs bootstrap = 17:19:23:29:31:37:calcPrimes 41 bootstrap primes' = calcPrimes 17 bootstrap primes'' = calcPrimes 17 primes' pmults :: a -> ([a],[a]) pmults p = case map (*p) (rollFrom p) of (x:xs) -> ([x],xs) multip :: [a] -> [([a],[a])] multip ps = map pmults ps compos :: [a] -> [a] compos ps = case pairwise mergeSP (multip ps) of ((a,b):cs) -> a ++ funMerge b (pairwise mergeSP cs) funMerge b (x:y:zs) = let (c,d) = mergeSP x y in mfun b c d (pairwise mergeSP zs) mfun u@(x:xs) w@(y:ys) d l = case compare x y of LT -> x:mfun xs w d l EQ -> x:mfun xs ys d l GT -> y:mfun u ys d l mfun u [] d l = funMerge (merge u d) l This uses a different folding structure again, which seems to give slightly better performance than the original tree-fold structure. In contrast to the VippyPrimes, it profits much from a larger allocation area, running with +RTS -A2M gives a >10% speedup for prime # 10M/20M, +RTS -A8M nearly 20%. -A16M and -A32M buy a little more, but in that range at least, it's not much (may be significant for larger targets). Still way slower than PQ, but the gap is narrowing.
mtwprimes () = 2:3:5:7:primes where primes = doPrimes 121 primes
doPrimes n prs = let (h,t) = span (< n) $ rollFrom 11 in h ++ t `diff` comps prs doPrimes2 n prs = let (h,t) = span (< n) $ rollFrom (12-1) in h ++ t `diff` comps prs
mtw2primes () = 2:3:5:7:primes where primes = doPrimes 26 primes2 primes2 = doPrimes2 121 primes2
Using 'splitAt 26' in place of 'span (< 121)' didn't work though.
How about them wheels? :)
Well, what about them?

Daniel Fischer
Am Samstag 09 Januar 2010 08:04:20 schrieb Will Ness:
Daniel Fischer
writes: Am Freitag 08 Januar 2010 19:45:47 schrieb Will Ness:
Daniel Fischer
writes: It's not tail-recursive, the recursive call is inside a celebrate.
It is (spMerge that is).
No. "In computer science, tail recursion (or tail-end recursion) is a special case of recursion in which the last operation of the function, the tail call, is a recursive call."
As far as I understand it, when a function makes a tail call to a tail recursive function (be it itself or some other function) it is itself tail recursive. I.e. that call may be replaced with a direct jump, with no new context to be created. That is what your version accomplishes, too. Mine really held to that pair ~(c,d) when it wanted to call (merge _ d) _after_ the call to spMerge. By passing the pre-decomposed part of a pair, 'd', into the process environment of spMerge, you've made it tail recursive - it carried along all the context it needed. That's what've eliminated the space leak, so I'd say tail recursion does play a role under lazy evaluation - when a compiler isn't smart enough to do _that_ for us by itself. _Were_ it reliably smart, even non- recursive functions like my initial variant would work just as well.
The last operation of spMerge is a call to celebrate or the pair constructor (be that P or (,)). Doesn't matter, though, as for lazy languages, tail recursion isn't very important.
It calls tail-recursive celebrate in a tail position. What you've done, is to eliminate the outstanding context, by moving it inward. Your detailed explanation is more clear than that. :)
BTW when I run VIP code it is consistently slower than using just pairs,
I can't reproduce that. Ceteris paribus, I get the exact same allocation and GC figures whether I use People or (,), running times identical enough (difference between People and (,) is smaller than the difference between runs of the same; the difference between the fastest and the slowest run of the two is less than 0.5%). I think it must be the other changes you made.
I just take the VIP code as it is on a web page, and my intial variant without the wheel, and compare. Then I add the wheel in the same fashion, and then feeder, and compare again. When I tested that Monid instance code I too compared it to the straight pairs, and it was slower. Don't know why.
modified with wheel and feeder and all. So what's needed is to re-implement your approach for pairs:
mergeSP (a,b) ~(c,d) = let (bc,bd) = spMerge b c d in (a ++ bc, bd) where spMerge b [] d = ([], merge b d) spMerge b@(x:xs) c@(y:ys) d = case compare x y of LT -> consSP x $ spMerge xs c d EQ -> consSP x $ spMerge xs ys d GT -> consSP y $ spMerge b ys d
consSP x ~(bc,bd) = (x:bc,bd) -- don't forget that magic `~` !!!
I called that (<:).
BTW I'm able to eliminate sharing without a compiler switch by using
Yes, I can too. But it's easy to make a false step and trigger sharing.
yes indeed. It's seems unpredictable. Fortunately GHC couldn't tell that (12-1) was 11 by the looks of it. :) Your idea certainly seems right, that there ought to be some control over sharing on a per-function basis somehow without these ridiculous code tricks.
I can get a nice speedup (~15%, mostly due to much less garbage collecting) by doing the final merge in a function without unnecessarily wrapping the result in a pair
Will have to wrap my head around _that_. But that would be fighting with the compiler again. I don't like that, I much rather fight with the problem at hand. There shouldn't be any pairs in the compiled code in the first place. They just guide the staging of (++) and (merge) intertwined between the producer streams. At each finite step, when the second part of a pair comes into play, it is only after its first part was completely consumed. I guess the next thing to try would be to actually create a data type MergeNode and arrange _those_ in a tree and see if that helps. That would be the next half-step towards the PQ itself.
This uses a different folding structure again,
which I am yet to decipher. :)
How about them wheels? :)
Well, what about them?
I dunno, it makes for a real easy wheel derivation, and coming out of our discussion of euler's sieve it's a nice cross-pollination. :) Having yet another list representation suddenly cleared up the whole issue (two of them). I'll repost it one last time as I've made some corrections to it:
euler ks@(p:rs) = p : euler (rs `minus` map (*p) ks) primes = euler [2..]
primes = euler $ listFrom [2] 1 = 2:euler ( listFrom [3] 1 `minus` map(2*) (listFrom [2] 1)) ) listFrom [3,4] 2 `minus` listFrom [4] 2 == listFrom [3] 2 == = 2:3:euler (listFrom [5] 2 `minus` map(3*) (listFrom [3] 2)) listFrom [5,7,9] 6 `minus` listFrom [9] 6 == listFrom [5,7] 6 ==
listFrom xs by = concat $ iterate (map (+ by)) xs rolls = unfoldr (Just . nextRoll) ([2],1) nextRoll r@(xs@(p:xt),b) = ( (p,r') , r') where r' = (xs',p*b) xs' = (concat $ take p $ iterate (map (+ b)) $ xt ++ [p+b]) `minus` map (p*) xs nthWheel n = let (ps,rs) = unzip $ take n rolls (x:xs,b) = last rs in ((ps, x), zipWith (-) (xs++[x+b]) (x:xs)) eulerPrimes n = let (ps,rs) = unzip $ take n rolls (qs@(q:_),b) = last rs in ps ++ takeWhile (< q^2) qs (BTW I've noticed that when I reply in Gmane, all the text below double hyphen, if present in the post to which I'm replying, just disappears. This may be an artefact of some specific browser.)

Daniel Fischer
Am Freitag 08 Januar 2010 19:45:47 schrieb Will Ness:
Daniel Fischer
writes: mergeSP :: Integral a => People a -> People a -> People a mergeSP p1@(P a _) p2 = P (a ++ vips mrgd) (dorks mrgd) where mrgd = spMerge (dorks p1) (vips p2) (dorks p2) spMerge l1 [] l3 = P [] (merge l1 l3) spMerge ~l1@(x:xs) l2@(y:ys) l3 = case compare x y of LT -> celebrate x (spMerge xs l2 l3) EQ -> celebrate x (spMerge xs ys l3) GT -> celebrate y (spMerge l1 ys l3)
----------------------------------------------------------------------
Actually, the minimal edit that does the trick (of eliminating the space leak that you've identified) for my original code is just mergeSP (a,b) ~(c,d) = let (bc,bd) = spMerge b c d in (a ++ bc, bd) where spMerge b [] d = ([] ,merge b d) spMerge b@(x:xs) c@(y:ys) d = case compare x y of LT -> (x:u,v) where (u,v) = spMerge xs c d EQ -> (x:u,v) where (u,v) = spMerge xs ys d GT -> (y:u,v) where (u,v) = spMerge b ys d spMerge [] c d = ([] ,merge c d) which hardly looks at all different at the first glance. Just for reference, it was {- mergeSP (a,b) ~(c,d) = let (bc,b') = spMerge b c in (a ++ bc, merge b' d) where spMerge b@(x:xs) c@(y:ys) = case compare x y of LT -> (x:u,v) where (u,v) = spMerge xs c EQ -> (x:u,v) where (u,v) = spMerge xs ys GT -> (y:u,v) where (u,v) = spMerge b ys spMerge b [] = ([] ,b) spMerge [] c = ([] ,c) -} spMerge of course is not tail recursive here in both versions if seen through the imperative eyes. But lazy evaluation makes it effectively so. The important thing is, when the final point is reached, there's no outstanding context - everything is present. There should be a name for such concept. This is very similar to late instantiation in Prolog (programming with "holes"), and I think this *would* pass as a tail-recursive function /there/. Even in the new code the compiler could've internally held on to the original pair and only deconstructed the 'd' out of it at the final call to merge, recreating the space leak. It could just as well have recognized that 'd' isn't changed inside spMerge (we're pure in Haskell after all) and deconstructed the pair in the original code. Something is missing here.
As it turns out, the important things are
1. a feeder and separate lists of multiples for the feeder and the runner, for the reasons detailed earlier (two-step feeding and larger wheel are pleasing but minor optimisations).
2. a strict pattern in tfold
3. moving the merge inside spMerge
Is this the state of our _best_ Haskell compiler????
Yes. It's still a "do what I tell you to" compiler, even if a pretty slick one, not a "do what I mean" compiler. Sometimes, what you tell the compiler isn't what you wanted. It's easier to predict when you give detailed step by step instructions.

Daniel Fischer wrote:
Why has
mergeSP (a,b) ~(c,d) = let (bc,b') = spMerge b c in (a ++ bc, merge b' d)
a memory leak, but
mergeSP (a,b) ~(c,d) = let (bc,m) = spMerge' b c d in (a ++ bc, m)
not?
Well, looking at the core for mergeSP, the fog clears somewhat. The former is translated roughly to
mergeSP (a,b) pair = let sm = spMerge b (fst pair) in (a ++ fst sm, merge (snd sm) (snd pair))
It holds on to the pair until the result of the merge is demanded, that is until the entire (a ++ fst sm) is consumed. Only then is the pair released and can be collected. On top of that, as soon as a is consumed and (fst sm) [or bc] is demanded, spMerge forces the evaluation of (fst pair) [c]. After a few levels, the evaluated list will take more space than the thunk. It cannot yet be collected, because pair is still alive. The elements have to be duplicated for (fst sm), because they're intertwined with those of b. On the next level of merging, they have to be duplicated again.
The latter is translated roughly to
mergeSP (a,b) pair = let sm = spMerge' b (fst pair) (snd pair) in (a ++ fst sm, snd sm)
The pair is released as soon as the result of the spMerge' is demanded, that is, when a is consumed. Then the elements of (fst pair) need not be duplicated and they can be discarded almost immediately after they have been produced [for small primes, multiples of larger primes live a little longer, but those are fewer].
So, no duplication, shorter lifespan => no leak. Having seen that, the question is, why can't the compiler see that deconstructing the pair when the first component is needed is better? The first component of the pair is used in no other place, so keeping the pair can't have any advantage, can it?
Tricky stuff. It is known that pairs/records are prone to unwanted retention, see for example the recent thread http://thread.gmane.org/gmane.comp.lang.haskell.cafe/66903/focus=67871 or Jan Sparud. Fixing some space leaks without a garbage collector. http://citeseer.ist.psu.edu/240848.html It is exactly because these troubles that I'm advocating the original VIP data structure that buries the dorks (that name is awesome :D) deep inside the structure. :) In any case, it appears to me that the lazy pattern match in mergeSP is actually unnecessary! This is because mergeSP returns a pair constructor immediately, so infinite nesting works even when the second argument is demanded. In other words, mergeSP :: Integral a => People a -> People a -> People a mergeSP (P a b) (P c d) = P (a ++ bc) (merge b' d) where P bc b' = spMerge b c spMerge [] ys = P [] ys spMerge xs [] = P [] xs spMerge xs@(x:xt) ys@(y:yt) = case compare x y of LT -> celebrate x (spMerge xt ys) EQ -> celebrate x (spMerge xt yt) GT -> celebrate y (spMerge xs yt) should work fine and do away with the memory issue because the pair is now deconstructed immediately. Hm, a strict second argument might however mean that the tree produced by tfold is demanded too early.
And why does
tfold f (a: ~(b: ~(c:xs))) = ...
leak, but not
tfold f (a:b:c:xs) = ...
?
I guess it's similar.
tfold f (a: ~(b: ~(c:xs))) = (a `f` (b `f` c)) `f` tfold f ([pairwise f] xs)
is
tfold f (a:zs) = (a `f` (head zs `f` (head $ tail zs))) `f` tfold f (pairwise f $ drop 2 zs)
the latter part holds on to the beginning of zs, leading again to data duplication and too long lifespans.
Same issue with retaining pairs while only the components are needed. Ideally, we ought to be able to "couple" the two components, i.e. the idea is that tail zs is reduced to a pointer to the tail whenever head zs is demanded and vice versa. I think that's what Sparud does in his paper, not sure how well it's implemented in GHC, I vaguely remember a bug reports. Alternatively, retaining zs in any way might already be too much. Regards, Heinrich Apfelmus -- http://apfelmus.nfshost.com

Am Dienstag 12 Januar 2010 11:30:07 schrieb Heinrich Apfelmus:
Tricky stuff. It is known that pairs/records are prone to unwanted retention, see for example the recent thread
http://thread.gmane.org/gmane.comp.lang.haskell.cafe/66903/focus=67871
or
Jan Sparud. Fixing some space leaks without a garbage collector. http://citeseer.ist.psu.edu/240848.html
"CiteSeerx is momentarily busy, Please try us again, in a few moments. We apologize for any inconvenience." :( tried and re-tried. Took http://bert.md.chalmers.se/pub/cs-reports/papers/sparud/spaceleak- fix.ps.gz finally
It is exactly because these troubles that I'm advocating the original VIP data structure that buries the dorks (that name is awesome :D) deep inside the structure. :)
In any case, it appears to me that the lazy pattern match in mergeSP is actually unnecessary! This is because mergeSP returns a pair constructor immediately, so infinite nesting works even when the second argument is demanded. In other words,
mergeSP :: Integral a => People a -> People a -> People a mergeSP (P a b) (P c d) = P (a ++ bc) (merge b' d) where P bc b' = spMerge b c spMerge [] ys = P [] ys spMerge xs [] = P [] xs spMerge xs@(x:xt) ys@(y:yt) = case compare x y of LT -> celebrate x (spMerge xt ys) EQ -> celebrate x (spMerge xt yt) GT -> celebrate y (spMerge xs yt)
should work fine and do away with the memory issue because the pair is now deconstructed immediately.
No, <<loop>>. For the reason you stated below. In tfold f (a:b:c:xs) = (a `f` (b `f` c)) `f` smartfold f xs , it must compute smartfold f xs too early to match it against P c d. The compiler can't see that smartfold mergeSP always returns a P [well, it might return _|_, mightn't it?].
Hm, a strict second argument might however mean that the tree produced by tfold is demanded too early.
Regards, Heinrich Apfelmus

Daniel Fischer wrote:
Heinrich Apfelmus wrote:
It is exactly because these troubles that I'm advocating the original VIP data structure that buries the dorks (that name is awesome :D) deep inside the structure. :)
In fact, your transformation that fixes the space leaks pretty much emulates the behavior of the old data People a = VIP a (People a) | Dorks [a] structure. This becomes apparent if we put the last two arguments of spMerge back into a pair: mergeSP (P a b) ~(P c d) = let P bc m = spMerge b (P c d) in P (a ++ bc) m where spMerge b (P [] d) = P [] (merge b d) spMerge xs@(x:xt) cd@(P (y:yt) d) = case compare x y of LT -> celebrate x (spMerge xt cd ) EQ -> celebrate x (spMerge xt (P yt d)) GT -> celebrate y (spMerge xs (P yt d)) In particular, forcing dorks (mergeSP ...) also forces the complete VIP list. I wonder whether it's really the liveness of pair in mergeSP (a,b) pair = let sm = spMerge b (fst pair) in (a ++ fst sm, merge (snd sm) (snd pair)) that is responsible for the space leak, for chances are that Sparud's technique applies and pair is properly disposed of. Rather, it could be that we need the stronger property that forcing the second component will evaluate the first to NF.
In any case, it appears to me that the lazy pattern match in mergeSP is actually unnecessary! This is because mergeSP returns a pair constructor immediately, so infinite nesting works even when the second argument is demanded. [...]
No, <<loop>>. For the reason you stated below. In
tfold f (a:b:c:xs) = (a `f` (b `f` c)) `f` smartfold f xs
, it must compute smartfold f xs too early to match it against P c d. The compiler can't see that smartfold mergeSP always returns a P [well, it might return _|_, mightn't it?].
Oops, silly me! Mea culpa, of course mergeSP a (mergeSP b (mergeSP c ..))) or any infinite nesting is not going to work with a strict pattern. >_> <_< Regards, Heinrich Apfelmus -- http://apfelmus.nfshost.com

Am Mittwoch 13 Januar 2010 10:43:42 schrieb Heinrich Apfelmus:
I wonder whether it's really the liveness of pair in
mergeSP (a,b) pair = let sm = spMerge b (fst pair) in (a ++ fst sm, merge (snd sm) (snd pair))
that is responsible for the space leak, for chances are that Sparud's technique applies and pair is properly disposed of. Rather, it could be that we need the stronger property that forcing the second component will evaluate the first to NF.
I think that is responsible. At least that's how I understand the core: mergeSP (a,b) ~(c,d) = (a ++ bc, merge b' d) where (bc, b') = spMerge b c spMerge ... ---------------------------------------------------------------------- OldMerge.$wmergeSP :: [GHC.Types.Int] -> [GHC.Types.Int] -> ([GHC.Types.Int], [GHC.Types.Int]) -> (# [GHC.Types.Int], [GHC.Types.Int] #) GblId [Arity 3 Str: DmdType LLL] OldMerge.$wmergeSP = \ (ww_sny :: [GHC.Types.Int]) (ww1_snz :: [GHC.Types.Int]) (w_snB :: ([GHC.Types.Int], [GHC.Types.Int])) -> let { ds_so7 [ALWAYS Just D(SS)] :: ([GHC.Types.Int], [GHC.Types.Int]) LclId [Str: DmdType] ds_so7 = case w_snB of _ { (c_adj, _) -> case OldMerge.$wspMerge ww1_snz c_adj of _ { (# ww3_snH, ww4_snI #) -> (ww3_snH, ww4_snI) } } } in (# GHC.Base.++ @ GHC.Types.Int ww_sny (case ds_so7 of _ { (bc_ajQ, _) -> bc_ajQ }), case ds_so7 of _ { (_, b'_ajS) -> case w_snB of _ { (_, d_adk) -> OldMerge.merge b'_ajS d_adk } -- Here, in the second component of the result, -- we reference the entire pair to get the dorks } #) OldMerge.mergeSP :: ([GHC.Types.Int], [GHC.Types.Int]) -> ([GHC.Types.Int], [GHC.Types.Int]) -> ([GHC.Types.Int], [GHC.Types.Int]) GblId [Arity 2 Worker OldMerge.$wmergeSP Str: DmdType U(LL)Lm] OldMerge.mergeSP = __inline_me (\ (w_snw :: ([GHC.Types.Int], [GHC.Types.Int])) (w1_snB :: ([GHC.Types.Int], [GHC.Types.Int])) -> case w_snw of _ { (ww_sny, ww1_snz) -> case OldMerge.$wmergeSP ww_sny ww1_snz w1_snB of _ { (# ww3_snN, ww4_snO #) -> (ww3_snN, ww4_snO) } }) ---------------------------------------------------------------------- vs. mergeSP (a,b) ~(c,d) = (a ++ bc, m) where (bc,m) = spMerge b c d spMerge ... ---------------------------------------------------------------------- NewMerge.$wmergeSP :: [GHC.Types.Int] -> [GHC.Types.Int] -> ([GHC.Types.Int], [GHC.Types.Int]) -> (# [GHC.Types.Int], [GHC.Types.Int] #) GblId [Arity 3 Str: DmdType LLL] NewMerge.$wmergeSP = \ (ww_snB :: [GHC.Types.Int]) (ww1_snC :: [GHC.Types.Int]) (w_snE :: ([GHC.Types.Int], [GHC.Types.Int])) -> let { ds_soa [ALWAYS Just D(SS)] :: ([GHC.Types.Int], [GHC.Types.Int]) LclId [Str: DmdType] ds_soa = case w_snE of _ { (c_adj, d_adk) -> -- There's no reference to the pair after this case NewMerge.$wspMerge ww1_snC c_adj d_adk of _ { (# ww3_snK, ww4_snL #) -> (ww3_snK, ww4_snL) } } } in (# GHC.Base.++ @ GHC.Types.Int ww_snB (case ds_soa of _ { (bc_ajT, _) -> bc_ajT }), case ds_soa of _ { (_, b'_ajV) -> b'_ajV } #) NewMerge.mergeSP :: ([GHC.Types.Int], [GHC.Types.Int]) -> ([GHC.Types.Int], [GHC.Types.Int]) -> ([GHC.Types.Int], [GHC.Types.Int]) GblId [Arity 2 Worker NewMerge.$wmergeSP Str: DmdType U(LL)Lm] NewMerge.mergeSP = __inline_me (\ (w_snz :: ([GHC.Types.Int], [GHC.Types.Int])) (w1_snE :: ([GHC.Types.Int], [GHC.Types.Int])) -> case w_snz of _ { (ww_snB, ww1_snC) -> case NewMerge.$wmergeSP ww_snB ww1_snC w1_snE of _ { (# ww3_snQ, ww4_snR #) -> (ww3_snQ, ww4_snR) } }) ----------------------------------------------------------------------

Daniel Fischer
Am Mittwoch 13 Januar 2010 10:43:42 schrieb Heinrich Apfelmus:
I wonder whether it's really the liveness of pair in
mergeSP (a,b) pair = let sm = spMerge b (fst pair) in (a ++ fst sm, merge (snd sm) (snd pair))
that is responsible for the space leak, for chances are that Sparud's technique applies and pair is properly disposed of. Rather, it could be that we need the stronger property that forcing the second component will evaluate the first to NF.
I think that is responsible. At least that's how I understand the core:
mergeSP (a,b) ~(c,d) = (a ++ bc, merge b' d) where (bc, b') = spMerge b c spMerge ...
That is equivalent to first (a++) . second (`merge`d) $ spMerge b c and Daniel's fix is equivalent to first (a++) $ spMerge b c d Now, when compiler sees the first variant, it probably treats spMerge as opaque. I.e. although in reality spMerge only contributes to the first "channel" while it is progressively instantiated, and (`merge`d) will only be called upon when spMerge's final clause is reached, that is (most likely) not known to the compiler at this stage. When looking at just the first expression itself, it has to assume that spMerge may contribute to both channels (parts of a pair) while working, and so can't know _when_ /d/ will get called upon to contribute to the data, as it is consumed finally at access. So /d/ is gotten hold of prematurely, _before_ going into spMerge. The second variant passes the responsibility for actually accessing its inputs to spMerge itself, and _it_ is clear about needing /d/ only in the very end. Just a theory. :) Does that make sense?

Am Donnerstag 14 Januar 2010 08:25:48 schrieb Will Ness:
Daniel Fischer
writes: Am Mittwoch 13 Januar 2010 10:43:42 schrieb Heinrich Apfelmus:
I wonder whether it's really the liveness of pair in
mergeSP (a,b) pair = let sm = spMerge b (fst pair) in (a ++ fst sm, merge (snd sm) (snd pair))
that is responsible for the space leak, for chances are that Sparud's technique applies and pair is properly disposed of. Rather, it could be that we need the stronger property that forcing the second component will evaluate the first to NF.
I think that is responsible. At least that's how I understand the core:
mergeSP (a,b) ~(c,d) = (a ++ bc, merge b' d) where (bc, b') = spMerge b c spMerge ...
That is equivalent to
first (a++) . second (`merge`d) $ spMerge b c
and Daniel's fix is equivalent to
first (a++) $ spMerge b c d
Now, when compiler sees the first variant, it probably treats spMerge as opaque. I.e. although in reality spMerge only contributes to the first "channel" while it is progressively instantiated, and (`merge`d) will only be called upon when spMerge's final clause is reached, that is (most likely) not known to the compiler at this stage. When looking at just the first expression itself, it has to assume that spMerge may contribute to both channels (parts of a pair) while working, and so can't know _when_ /d/ will get called upon to contribute to the data, as it is consumed finally at access.
So /d/ is gotten hold of prematurely, _before_ going into spMerge.
No, the problem is that d itself is gotten hold of too late, it is accessed only indirectly via the pair, so we keep an unnecessary reference to c via d.
The second variant passes the responsibility for actually accessing its inputs to spMerge itself, and _it_ is clear about needing /d/ only in the very end.
The second variant is clear about not needing the _pair_ anymore once spMerge is entered. Thus d doesn't reference c anymore.
Just a theory. :)
Does that make sense?

Daniel Fischer
Am Donnerstag 14 Januar 2010 08:25:48 schrieb Will Ness:
Daniel Fischer
writes: Am Mittwoch 13 Januar 2010 10:43:42 schrieb Heinrich Apfelmus:
I wonder whether it's really the liveness of pair in
mergeSP (a,b) pair = let sm = spMerge b (fst pair) in (a ++ fst sm, merge (snd sm) (snd pair))
that is responsible for the space leak, ...
I think that is responsible. At least that's how I understand the core:
mergeSP (a,b) ~(c,d) = (a ++ bc, merge b' d) where (bc, b') = spMerge b c spMerge ...
That is equivalent to
first (a++) . second (`merge`d) $ spMerge b c
and Daniel's fix is equivalent to
first (a++) $ spMerge b c d
That should've been mergeSP (a,b) p = first(a++) . second(`merge`snd p) $ spMerge b (fst p) and mergeSP (a,b) p = first(a++) $ spMerge b (fst p) (snd p) The code fragments you've posted are essentially mergeSP (a,b) p = let res = case p of (c,_) -> case spMerge b c of (# x,y #) -> (x,y) in (# (++) a (case res of (bc,_)-> bc) , case res of (_,b') -> case p of (_,d) -> merge b' d #) and mergeSP (a,b) p = let res = case p of (c,d) -> case spMerge b c d of (# x,y #) -> (x,y) in (# (++) a (case res of (bc,_)-> bc) , case res of (_,b') -> b' #) This looks like Haskell to me, with many detailes explicitely written out, probaly serving as immediate input to the compiler - not its output. So it can't say to us much about how this is actually implemented on the lower level. (?) Your theory would certainly hold if the translation was done one-to-one without any further code rearrangements. But it could've been further re-arranged by the compiler at some later stage (is there one?) into an equivalent of, e.g. mergeSP (a,b) p = let (bc,b') = case p of (c,_) -> case spMerge b c of (x,y) -> (x,y) in (# (++) a bc , case p of (_,d) -> merge b' d #) and further, mergeSP (a,b) p = let (c,d) = case p of (x,y) -> (x,y) (bc,b') = case spMerge b c of (x,y) -> (x,y) in (# (++) a bc , merge b' d #) could it? This would take hold on /d/ and /c/ at the same time, right? What is that code that you've shown exactly? At which stage is it produced and is it subject to any further manipulation? I apologise if these are obvious questions, I don't know anything about GHC. I also don't know what (# x,y #) means? One thing seems certain - we should not hold explicit references to same entities in different parts of our code, to avoid space leaks with more confidence. To make code look as much tail-recursive as possible, so to speak. Does that make sense? Anyway that was a very educational (for me) and fruitful discussion, and I greatly appreciate your help, and fixing and improving of the code. Thanks!

Am Samstag 16 Januar 2010 18:53:33 schrieb Will Ness:
Daniel Fischer
writes: Am Donnerstag 14 Januar 2010 08:25:48 schrieb Will Ness:
Daniel Fischer
writes: Am Mittwoch 13 Januar 2010 10:43:42 schrieb Heinrich Apfelmus:
I wonder whether it's really the liveness of pair in
mergeSP (a,b) pair = let sm = spMerge b (fst pair) in (a ++ fst sm, merge (snd sm) (snd pair))
that is responsible for the space leak, ...
I think that is responsible. At least that's how I understand the core:
mergeSP (a,b) ~(c,d) = (a ++ bc, merge b' d) where (bc, b') = spMerge b c spMerge ...
That is equivalent to
first (a++) . second (`merge`d) $ spMerge b c
and Daniel's fix is equivalent to
first (a++) $ spMerge b c d
That should've been
mergeSP (a,b) p = first(a++) . second(`merge`snd p) $ spMerge b (fst p)
and
mergeSP (a,b) p = first(a++) $ spMerge b (fst p) (snd p)
The code fragments you've posted are essentially
mergeSP (a,b) p = let res = case p of (c,_) -> case spMerge b c of (# x,y #) -> (x,y) in (# (++) a (case res of (bc,_)-> bc) , case res of (_,b') -> case p of (_,d) -> merge b' d #)
and
mergeSP (a,b) p = let res = case p of (c,d) -> case spMerge b c d of (# x,y #) -> (x,y) in (# (++) a (case res of (bc,_)-> bc) , case res of (_,b') -> b' #)
This looks like Haskell to me, with many detailes explicitely written out, probaly serving as immediate input to the compiler - not its output. So it can't say to us much about how this is actually implemented on the lower level. (?)
It's 'core', the intermediate language GHC uses and does its transformations upon. It's more or less a slimmed down version of Haskell. That came from -ddump-simpl, thus it's the output of the simplifier, after numerous transformation/optimisation passes. I think that is then fairly directly translated to assembly (via Cmm), the STG to STG and Cmm passes do not much optimisation anymore (I may be wrong, what know I about the compiler. However, I've found that the -ddump-simpl output corresponds well to the actual behaviour whenever I look.). I find that much more redable than the -fext-core output (http://www.haskell.org/ghc/docs/6.12.1/html/users_guide/ext-core.html "GHC can dump its optimized intermediate code (said to be in “Core” format) to a file as a side-effect of compilation."), which says the same, only more verbose and less readable. Of course, if I could read assembly, that would exactly reveal what happens.
Your theory would certainly hold if the translation was done one-to-one without any further code rearrangements. But it could've been further re-arranged by the compiler at some later stage (is there one?) into an equivalent of, e.g.
mergeSP (a,b) p = let (bc,b') = case p of (c,_) -> case spMerge b c of (x,y) -> (x,y) in (# (++) a bc , case p of (_,d) -> merge b' d #)
and further,
mergeSP (a,b) p = let (c,d) = case p of (x,y) -> (x,y) (bc,b') = case spMerge b c of (x,y) -> (x,y) in (# (++) a bc , merge b' d #)
could it? This would take hold on /d/ and /c/ at the same time, right?
It would. But AFAICT, after the simplifier is done, no such rearrangements occur anymore.
What is that code that you've shown exactly? At which stage is it produced and is it subject to any further manipulation?
I'm no GHC expert either, so I don't know what it does when exactly. But after parsing and desugaring, there come a few iterations of the simplifier, intermingled with specialising, demand-analysis, CSE, let- floating, worker-wrapper-split, ... . At the end of all that, the Tidy Core is generated (part of which I posted). What happens from then on, well ...
I apologise if these are obvious questions, I don't know anything about GHC. I also don't know what (# x,y #) means?
Unboxed tuple (pair in this case). That means, have the components for themselves, don't wrap them in a (,) constructor.
One thing seems certain - we should not hold explicit references to same entities in different parts of our code, to avoid space leaks with more confidence.
Except when it's good to share ;)
To make code look as much tail-recursive as possible, so to speak.
Tail recursion isn't really important (for GHC at least, I think for lazy languages in general), due to different evaluation models (cf. http://www.haskell.org/pipermail/haskell-cafe/2009-March/058607.html and the thread starting with http://www.haskell.org/pipermail/haskell-cafe/2007-May/025570.html).
Does that make sense?
In general, yes.
Anyway that was a very educational (for me) and fruitful discussion, and I greatly appreciate your help, and fixing and improving of the code.
Thanks!
You're welcome.

Daniel Fischer
roll = scanl (+) wheel = 2:4:2:4:6:2:6:4:2:4:6:6:2:6:4:2:6:4:6:8:4:2:4:2: 4:8:6:4:6:2:4:6:2:6:6:4:2:4:6:2:6:4:2:4:2:10:2:10:wheel wheel11 = res where snms = scanl (+) 11 (take 47 wheel) nums = tail $ scanl (+) 11 (take 529 wheel) cops = nums `minus` map (*11) snms diffs = zipWith (-) (tail cops) cops res = foldr (:) res diffs wheel13 = res where snms = take 480 $ scanl (+) 13 wheel11 nums = take (480*13+1) . tail $ scanl (+) 13 wheel11 cops = nums `minus` map (*13) snms diffs = zipWith (-) (tail cops) cops res = foldr (:) res diffs
BTW have you seen my take on the "faithful Euler's sieve"? It shows another way to look at the wheels, which for me was really the first time I really understood what's going on there. It also makes for easier wheel extention (IMO):
euler ks@(p:rs) = p : euler (rs `minus` map (*p) ks) primes = euler [2..]
The essence of Euler's sieve is the wheel: after each step we're left with lists of non-multiples of the preceding primes: primes = euler $ listFrom [2] 1 = 2:euler ( listFrom [3] 1 `minus` map(2*) (listFrom [2] 1)) ) listFrom [3,4] 2 `minus` listFrom [4] 2 -- listFrom [3] 2 -- = 2:3:euler (listFrom [5] 2 `minus` map(3*) (listFrom [3] 2)) listFrom [5,7,9] 6 `minus` listFrom [9] 6 -- listFrom [5,7] 6 -- = 2:3:5:euler (listFrom [7,11] 6 `minus` listFrom [25,35] 30) [7,11, 13,17, 19,23, 25,29, 31,35] 30 -- listFrom [7,11,13,17,19,23,29,31] 30 -- = ..... where listFrom xs by = concat $ iterate (map (+ by)) xs so -- startRoll = ([2],1) nextRoll r@(xs@(x:xt),b) = ( (x,r') , r') where ys@(y:_) = xt ++ [x + b] r' = (xs',b') b' = x*b xs' = takeWhile (< y + b') (listFrom ys b) `minus` map (x*) xs rolls = unfoldr (Just . nextRoll) ([2],1) nthWheel n = let (ps,rs) = unzip $ take n rolls (x:xs,b) = last rs in ((ps, x), zipWith (-) (xs++[x+b]) (x:xs)) {- *Main> mapM print $ take 4 rolls (2,([3],2)) (3,([5,7],6)) (5,([7,11,13,17,19,23,29,31],30)) (7,([11,13,17,19,23,29,31,37,41,43,47,53,59,61,67,71,73,79,83,89,97, 101,103,107,109,113,121,127,131,137,139,143,149,151,157,163,167,169, 173,179,181,187,191,193,197,199,209,211],210)) *Main> nthWheel 3 (([2,3,5],7),[4,2,4,2,4,6,2,6]) *Main> nthWheel 4 (([2,3,5,7],11),[2,4,2,4,6,2,6,4,2,4,6,6,2,6,4,2,6,4,6,8,4,2,4,2, 4,8,6,4,6,2,4,6,2,6,6,4,2,4,6,2,6,4,2,4,2,10,2,10]) -} Coincidentally, the function specified by eulerPrimes n = let (ps,rs) = unzip $ take n rolls (qs@(q:_),b) = last rs in ps ++ takeWhile (< q^2) qs can be used to write the specialized nthEulerPrime etc., whose complexity seems to be about O(n^1.5). Maybe this reinvents some spiraling wheels or somethn'. :)

Daniel Fischer
Am Donnerstag 07 Januar 2010 11:43:44 schrieb Heinrich Apfelmus:
Will Ness wrote:
Heinrich Apfelmus writes:
The below code is now a well-behaved memory citizen (3MB for the 100
millionth prime, about the same as the PQ code). It still is considerably slower than the PQ code.
In terms of MUT times as reported by +RTS -sstderr - as well as (MUT+GC) times - (measured once for prime No. 5*10^5, 10^6, 2*10^6, 4*10^6, 10^7 to get a rough tendency), it seems to scale a wee bit better than any of the other tfold versions I created, but a little worse than the PQ versions. The relation of MUT times isn't too bad, but the GC times are rather abysmal (30-40%).
----------------------------------------------------------------------
data People a = P { vips :: [a], dorks :: [a] }
celebrate x p = P (x:vips p) (dorks p)
mergeSP p1@(P a _) p2 = P (a ++ vips mrgd) (dorks mrgd) where mrgd = spMerge (dorks p1) (vips p2) (dorks p2) spMerge l1 [] l3 = P [] (merge l1 l3) spMerge ~l1@(x:xs) l2@(y:ys) l3 = case compare x y of LT -> celebrate x (spMerge xs l2 l3) EQ -> celebrate x (spMerge xs ys l3) GT -> celebrate y (spMerge l1 ys l3)
I forgot to say something *very* important. :) Here it is. Yippee-hurray! :)

Heinrich Apfelmus
Will Ness wrote:
But I get the impression that GHC isn't working through equational reasoning?.. I see all this talk about thunks etc.
Sure it does. Concerning the thunks, they're part of the implementation of the reduction model (call-by-need aka lazy evaluation).
At run-time? I meant to eliminate as much calculation as possible, pre-run-time. I would expect the best efforts of the best minds to go into searching for ways how to eliminate computations altogether, instead of how to perform them better.
Concerning the sieves, there is a fundamental difference between the imperative sieve and the functional sieves, regardless of whether the latter start at p or p^2 or use a priority queue. [...]
We can directy jump to the next multiple too, it is called (+). :) :)
But seriously, the real issue is that we have to merge the produced streams of multiples, while the mutable-storage code works on same one, so its "merging cost" is zero.
Not quite, I think there are two things going on:
1. In contrast to the standard imperative version, we are implementing an on-line algorithm that produces arbitrarily many primes on demand. Any imperative on-line version will need to think about storage for pending prime filters as well.
True.
2. Even the imperative algorithm can be interpreted as "merging" arrays, just that the composites are magically merged at the right place (at distance p from each other) because arrays offer O(1) "jumps".
i.e. with a merging cost of zero. :)
In contrast, the functional version needs O(log something) time to calculate where to put the next composite.
when thinking it terms of the finally produced sequence, yes. We have to produce numbers one by one and take care of their ordering ourselves; _they_ just /throw/ the numbers at the shared canvas and let _it_ take care of ordering them automagically, _later_, on the final sweep through. ISWYM.
If you could take a look at the tree-merging primes and why it uses too much memory, it would be great.
Fortunately, Daniel already took a detailed look. :)
Yes he really came through! He finally found, and fixed, the space leak. It was hiding in mergeSP_BAD (a,b) ~(c,d) = let (bc,b') = spMerge b c in (a ++ bc, merge b' d) where spMerge :: (Ord a) => [a] -> [a] -> ([a],[a]) spMerge a@(x:xs) b@(y:ys) = case compare x y of LT -> (x:c,d) where (c,d) = spMerge xs b EQ -> (x:c,d) where (c,d) = spMerge xs ys GT -> (y:c,d) where (c,d) = spMerge a ys spMerge a [] = ([] ,a) spMerge [] b = ([] ,b) which really ought to have been mergeSP (a,b) ~(c,d) = let ~(bc,bd) = spMerge b c d in (a ++ bc, bd) where spMerge u [] d = ([], merge u d) spMerge u@(x:xs) w@(y:ys) d = case compare x y of LT -> spCons x $ spMerge xs w d EQ -> spCons x $ spMerge xs ys d GT -> spCons y $ spMerge u ys d spCons x ~(a,b) = (x:a,b) Can you spot the difference? :) :)
Aww, why remove the cutesy name? The VIPs will be angry for being ignored!
It runs faster on plain pairs, and on less memory, equals for equals. For some reason. :)

Am Montag 04 Januar 2010 13:25:47 schrieb Will Ness:
Daniel Fischer
writes: Am Sonntag 03 Januar 2010 09:54:37 schrieb Will Ness:
Daniel Fischer
writes: But there's a lot of list constructuion and deconstruction necessary for the Euler sieve.
yes. Unless, of course, s smart compiler recognizes there's only one consumer for the values each multiples-list is producing, and keeps it stripped down to a generator function, and its current value. I keep thinkig a smart compiler could eliminate all these "span" calls and replace them with just some pointers manipulating...
Of course I'm no smart compiler, but I don't see how it could be even possible to replace the span calls with pointer manipulation when dealing with lazily generated (infinite, if we're really mean) lists. Even when you're dealing only with strict finite lists, it's not trivial to do efficiently.
I keep thinking that storage duplication with span, filter etc. is not really necessary, and can be replaced with some pointer chasing - especially when there's only one consumer present for the generated values.
What I mean is thinking of lists in terms of produce/consumer paradigm, as objects supporting the { pull, peek } interface, keeping the generator inside that would produce the next value on 'pull' request and keep it ready for any 'peek's.
Euler's sieve is
sieve (p:ps) xs = h ++ sieve ps (t `minus` map (p*) [p,p+2..]) where (h,t) = span (< p*p) xs
Not quite. That's a variant of the postponed filters, it crosses off e.g. 45 twice, once as 3*15 and once as 5*9 (okay, it has already been removed by the first, so let's say 45 appears in two lists of numbers to be removed if present). Euler's sieve is never attempting to remove a number more than once, that's the outstanding feature of it. Unfortunately, that also makes it hard to implement efficiently. The problem is that you can't forget the primes between p and p^2, you must remember them for the removal of multiples of p later on. An (inefficient but faithful) implementation would be euler ks@(p:rs) = p : euler (rs `minus` map (*p) ks) Its performance is really horrible though. A much better implementation is primes = 2:3:euler hprs [] (scanl (+) 5 (cycle [2,4])) where hprs = 5:drop 3 primes euler (p:ps) acc cs = h ++ euler ps (tail (acc ++ h)) (t `minus` comps) where (h,t) = span (< p*p) cs comps = map (*p) (acc ++ cs) still really bad. I don't yet see how to eat the cake and have it here.
I think that remark was meant to apply to composite removal, not Turner's sieve.
It is right there on page 2, right when the Turner's sieve is presented and discussed. The only explanation that I see is that she thought of it in regards to the imperative code, just as her analysis concentrates only on calculation aspects of the imperative code itself.
To be fair, she writes:
"Let us first describe the original “by hand” sieve algorithm as practiced by Eratosthenes. ... The starting point of p^2 is a pleasing but minor optimization, which can be made because lower multiples will have already been crossed off when we found the primes prior to p. ... (This optimization does not affect the time complexity of the sieve, however, so its absence from the code in Section 1 *is not our cause for worry*.)"
A-HA!
But its absense from _*that*_ _*code*_ WAS the *major* cause for worry, as dealing with it worked wonders on its complexity and constant factors.
I think you're overinterpreting it. Sure, it's not perfectly worded, but her concern was primarily that it's not an Eratosthenes sieve but trial division (and woefully inefficient at that), I think. Now, because it's trial division and not Eratosthenes, it has a major impact, thus the absence of the optimisation is a reinforcing consequence of the original cause for worry. I may be overdefending her, though, we can't really know what she meant.

Daniel Fischer
Am Montag 04 Januar 2010 13:25:47 schrieb Will Ness:
Euler's sieve is
sieve (p:ps) xs = h ++ sieve ps (t `minus` map (p*) [p,p+2..]) where (h,t) = span (< p*p) xs
Not quite. That's a variant of the postponed filters, it crosses off e.g. 45 twice, once as 3*15 and once as 5*9 (okay, it has already been removed by the first, so let's say 45 appears in two lists of numbers to be removed if present).
there won't be any such. whatever is removed first, is just skipped second (and third, etc). 45 does appear twice on the two multiples ists (3- and 5-). But it is "removed" by first, and skipped by second. And there's no removal here in the first place. There is no pre-filled storage. All there is, is some lists comparison, and lists are just generators (I so keep hoping). I don't see any problem here. As Melissa (and yourself, I think) have shown, double hits on multiples are few and far between. Also, it uses no filters, i.e. no individual number divisibility testing. The "filters" refers first of all to testing an individual number to decide whether to keep it in or not. Euler's sieve removes multiples in advance, so there's no testing and no filtering, only comparison. It follows the _Postponed_ Filter's framework in postponing the action until the right moment; the action itself is two lists comparison and skipping of the equals (i.e. the "minus" action).
Euler's sieve is never attempting to remove a number more than once, that's
How's that possible? On wikipedia is says, it removes multiples of 3; then multiples of 5; etc. So it just looks for each number it is removing, if it is there, and if so, removes it. I would argue that looking whether to remove or not, is part of attempting to remove. It can't have foresight, right?
the outstanding feature of it. Unfortunately, that also makes it hard to implement efficiently. The problem is that you can't forget the primes between p and p^2, you must remember them for the removal of multiples of p later on.
you're right, every formulation of these algorithms is always done in the imperative, mutable-storage setting. They always speak of "removing" numbers etc. The more pressing problem with that code is its linear structure of course which gets addressed by the tree-folding merge etc.
An (inefficient but faithful) implementation would be
euler ks@(p:rs) = p : euler (rs `minus` map (*p) ks)
I think it does exactly the same thing, computationally, except it does so, again, _prematurely_ exactly in Turner's sieve's fashion - for each prime found, _as_ _soon_ as it is found. If it is faithful, so is my code. :)
Its performance is really horrible though.
makes sense, as for Turner's. See, the _when_ thing really helps to see what's going on here.
A much better implementation is
primes = 2:3:euler hprs [] (scanl (+) 5 (cycle [2,4])) where hprs = 5:drop 3 primes euler (p:ps) acc cs = h ++ euler ps (tail (acc ++ h)) (t `minus` comps) where (h,t) = span (< p*p) cs comps = map (*p) (acc ++ cs)
this look like {2,3} wheel reimplemented and inlined. No point in improving anything until the linear structure isn't turned into a tree.
To be fair, she writes:
... (This optimization does not affect the time complexity of the sieve, however, so its absence from _the_ _code_ in Section 1 is _not_ our cause for worry.)"
A-HA!
But its absense from _*that*_ _*code*_ WAS the *major* cause for worry, as dealing with it worked wonders on its complexity and constant factors.
I think you're overinterpreting it.
I might have. I don't mean to neatpick, honest. It's just how it looked to me: "Turner's? - bad. Squares? - won't help, it's a _minor_ thing; don't even go there!" etc. As I've said, geniuses leap, and fly. I just wanted to _walk_ down that road, not skipping any steps. And it turned out, the very first step _really_ pays up _big_. So big in fact, it might be argued that any subsequent improvement is just an advanced optimization, in terms of presenting an introductory code which isn't horribly inefficient and yet is short and clear.
_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe <at> haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

Am Montag 04 Januar 2010 19:16:32 schrieb Will Ness:
Daniel Fischer
writes: Am Montag 04 Januar 2010 13:25:47 schrieb Will Ness:
Euler's sieve is
sieve (p:ps) xs = h ++ sieve ps (t `minus` map (p*) [p,p+2..]) where (h,t) = span (< p*p) xs
Not quite. That's a variant of the postponed filters, it crosses off e.g. 45 twice, once as 3*15 and once as 5*9 (okay, it has already been removed by the first, so let's say 45 appears in two lists of numbers to be removed if present).
there won't be any such. whatever is removed first, is just skipped second (and third, etc). 45 does appear twice on the two multiples ists (3- and 5-). But it is "removed" by first, and skipped by second. And there's no removal here in the first place. There is no pre-filled storage. All there is, is some lists comparison, and lists are just generators (I so keep hoping).
((45:(offer 47 when demanded)) `minus` (45:(next will be 51 when demanded))) `minus` (45: (next will be 55 when demanded)) So there are two attempts to tell the generator to not output 45. To the second, it answers "I already knew that", but the request is made nevertheless. Lists sometimes are data structures occupying physical storage (cons cell, point to value, pointer to next...), sometimes generators (when asked for the next value, I'll answer this). I say a value is removed from a list regardless of whether it takes relinking pointers in a list of cons cells or it involves telling the generator not to give out that value. I could also say 'eliminated'. There are two attempts to eliminate 45.
I don't see any problem here. As Melissa (and yourself, I think) have shown, double hits on multiples are few and far between.
It's not a problem, it just means it's not Euler's sieve, because that attempts to eliminate each composite exactly once.
Also, it uses no filters, i.e. no individual number divisibility testing. The "filters" refers first of all to testing an individual number to decide whether to keep it in or not.
Umm, the postponed filters says keep everything until p^2, then eliminate (filter out, remove) multiples of p in the remainder, after that, pick next prime. That's precisely what the above does. It doesn't do the filtering out by divisibility testing but by minus (hence more efficiently). I would say that's a variant of the postponed filters.
Euler's sieve removes multiples in advance, so there's no testing and no filtering, only comparison. It follows the _Postponed_ Filter's framework in postponing the action until the right moment; the action itself is two lists comparison and skipping of the equals (i.e. the "minus" action).
Euler's sieve is never attempting to remove a number more than once, that's
How's that possible?
http://en.wikipedia.org/wiki/Sieve_of_Eratosthenes#Euler.27s_Sieve "Euler in his Proof of the Euler product formula for the Riemann zeta function came up with a version of the sieve of Eratosthenes, *better in the sense that each number was eliminated exactly once*." C) The number after the previous prime is also a prime. *Multiply each number in the list starting from this prime by this prime and discard the products*.
On wikipedia is says, it removes multiples of 3; then multiples of 5; etc. So it just looks for each number it is removing, if it is there, and if so, removes it. I would argue that looking whether to remove or not, is part of attempting to remove. It can't have foresight, right?
But it has :) By only trying to eliminates products of the prime p currently under consideration *with numbers (>= p) which have not yet been eliminated from the list*, it is known in advance that all these products are still in the list. When p is under consideration, the list contains (apart from the primes < p) precisely the numbers whose smallest prime factor is >= p.
the outstanding feature of it. Unfortunately, that also makes it hard to implement efficiently. The problem is that you can't forget the primes between p and p^2, you must remember them for the removal of multiples of p later on.
you're right, every formulation of these algorithms is always done in the imperative, mutable-storage setting. They always speak of "removing" numbers etc.
sed s/remove/eliminate/ ?
The more pressing problem with that code is its linear structure of course which gets addressed by the tree-folding merge etc.
Which unfortunately introduces its own space problem :(
An (inefficient but faithful) implementation would be
euler ks@(p:rs) = p : euler (rs `minus` map (*p) ks)
I think it does exactly the same thing, computationally,
No, in map (*p) ks, there are no numbers with a smaller prime factor than p, while in map (*p) [p, p+2 .. ] there are (for p > 3)
except it does so, again, _prematurely_ exactly in Turner's sieve's fashion - for each prime found, _as_ _soon_ as it is found. If it is faithful, so is my code. :)
It is faithful in that it tries to eliminate each composite exactly once. Try a different minus: xxs@(x:xs) `minus` yys@(y:ys) = case compare x y of LT -> x : xs `minus` yys EQ -> xs `minus` ys GT -> error ("trying to remove " ++ show y ++ " a second time") Your code is not. It is, however, much faster.
Its performance is really horrible though.
makes sense, as for Turner's.
It's far worse than Turner's. The "map (*p) ks" holds on to things really really long. Measure it, just for teh lulz.
See, the _when_ thing really helps to see what's going on here.
A much better implementation is
primes = 2:3:euler hprs [] (scanl (+) 5 (cycle [2,4])) where hprs = 5:drop 3 primes euler (p:ps) acc cs = h ++ euler ps (tail (acc ++ h)) (t `minus` comps) where (h,t) = span (< p*p) cs comps = map (*p) (acc ++ cs)
this look like {2,3} wheel reimplemented and inlined.
The {2,3} wheel thing isn't important. It could equally well (well, a bit slower) be primes = euler hprs [] [2 .. ] where hprs = 2:tail primes It just occured to me that the accumulation list is just (takeWhile (< head cs) (p:ps)), so we could improve it as euler (p:ps) cs = h ++ euler ps (t `minus` comps) where (h,t) = span (< p*p) cs comps = map (*p) (takeWhile (< head cs) (p:ps) ++ cs) Still Bad.
No point in improving anything until the linear structure isn't turned into a tree.
You're a tree hugger? :D No, the point isn't linear vs. tree, it's that Euler's sieve is a) great for theoretical reasoning (look at a proof of Euler's product formula for the (Riemann) Zeta-function to appreciate it) b) reasonably good for sieving with a list of numbers written on paper, but c) not good to implement on a computer. For the computer, keeping track of which numbers are left is so much more complicated than just ticking off numbers a couple of times that it isn't even worth attempting it.

Daniel Fischer
Am Montag 04 Januar 2010 19:16:32 schrieb Will Ness:
Daniel Fischer
writes: Am Montag 04 Januar 2010 13:25:47 schrieb Will Ness:
Euler's sieve is
sieve (p:ps) xs = h ++ sieve ps (t `minus` map (p*) [p,p+2..]) where (h,t) = span (< p*p) xs
Not quite. That's a variant of the postponed filters, it crosses off e.g. 45 twice, once as 3*15 and once as 5*9 (okay, it has already been removed by the first, so let's say 45 appears in two lists of numbers to be removed if present).
there won't be any such. whatever is removed first, is just skipped second (and third, etc).
((45:(offer 47 when demanded)) `minus` (45:(next will be 51 when demanded))) `minus` (45:(next will be 55 when demanded))
So there are two attempts to tell the generator to not output 45. To the second, it answers "I already knew that", but the request is made nevertheless.
yes, of course.
... There are two attempts to eliminate 45.
I would say there are two requests to not have 45 in the output.
I don't see any problem here. As Melissa (and yourself, I think) have shown, double hits on multiples are few and far between.
It's not a problem, it just means it's not Euler's sieve, because that attempts to eliminate each composite exactly once.
yes I see now. My bad. Wasn't reading that wikipedia page with enough attention to detail. It uses the modified (culled, so far) numbers to find out the next multiples to be eliminated, not just the prime itself like my code does. You solution is indeed, exponential: euler ks@(p:rs) = p : euler (rs `minus` map (*p) ks) primes = 2:euler [3,5..] primes = 2:euler (as@[3,5..]) 3:euler (bs@(tail as `minus` map (3*) as)) 5:euler (cs@(tail bs `minus` map (5*) bs)) There are two separate look-back pointers to /as/ in /bs/, and there are two such in /cs/, to /bs/. The book-keeping machinery explodes.
Also, it uses no filters, i.e. no individual number divisibility testing. The "filters" refers first of all to testing an individual number to decide whether to keep it in or not.
Umm, the postponed filters says keep everything until p^2, then eliminate (filter out, remove) multiples of p in the remainder, after that, pick next prime. That's precisely what the above does. It doesn't do the filtering out by divisibility testing but by minus (hence more efficiently). I would say that's a variant of the postponed filters.
Filter is usually (as in Haskell's 'filter') is about testing individual elements by a predicate function. There is of course a filtering effect in two lists elts' comparison that 'minus' performs, so that's debatable. Even the PQ code performs "filtering" in this wider sense.
Euler's sieve is never attempting to remove a number more than once, that's
How's that possible?
http://en.wikipedia.org/wiki/Sieve_of_Eratosthenes#Euler.27s_Sieve
C) The number after the previous prime is also a prime. *Multiply each number /that's left/ in the list starting from this prime by this prime and discard the products*.
yes. Wasn't paying attention to that, more to the "intent" of it. There's of course enourmous vagueness in how exactly it is to be performed, in the unbounded case, which you uncovered here.
It can't have foresight, right?
But it has :) By only trying to eliminates products of the prime p currently under consideration *with numbers (>= p) /which have not/ /yet been eliminated/ from the list*, it is known in advance that all these products are still in the list.
When p is under consideration, the list contains (apart from the primes < p)
missed that. precisely the numbers whose smallest prime factor is >= p.
the outstanding feature of it. Unfortunately, that also makes it hard to implement efficiently. The problem is that you can't forget the primes between p and p^2, you must remember them for the removal of multiples of p later on.
not just primes - all the composites not yet removed, too. So it can't even be implemented on shared mutable storage if we want to delay the removal (as we must, in unbounded case) - the composites will get removed so their multiples must actually all be produced first!
The more pressing problem with that code is its linear structure of course which gets addressed by the tree-folding merge etc.
Which unfortunately introduces its own space problem :(
Try a different minus:
xxs <at> (x:xs) `minus` yys <at> (y:ys) = case compare x y of LT -> x : xs `minus` yys EQ -> xs `minus` ys GT -> error ("trying to remove " ++ show y ++ " a second time")
Your code is not. It is, however, much faster.
I understand now. Thanks!
Its performance is really horrible though.
exponential, empyrically as well.
It just occured to me that the accumulation list is just (takeWhile (< head cs) (p:ps)), so we could improve it as
euler (p:ps) cs = h ++ euler ps (t `minus` comps) where (h,t) = span (< p*p) cs comps = map (*p) (takeWhile (< head cs) (p:ps) ++ cs)
Still Bad.
yes, _both_ /t/ and /comps/ have their back-pointers into cs.
No point in improving anything until the linear structure isn't turned into a tree.
You're a tree hugger? :D No, the point isn't linear vs. tree, it's that Euler's sieve is a) great for theoretical reasoning (look at a proof of Euler's product formula for the (Riemann) Zeta-function to appreciate it) b) reasonably good for sieving with a list of numbers written on paper, but c) not good to implement on a computer. For the computer, keeping track of which numbers are left is so much more complicated than just ticking off numbers a couple of times that it isn't even worth attempting it.
_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe <at> haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

Am Dienstag 05 Januar 2010 14:49:58 schrieb Will Ness:
... There are two attempts to eliminate 45.
I would say there are two requests to not have 45 in the output.
Thers are many possible ways to phrase it.
I don't see any problem here. As Melissa (and yourself, I think) have shown, double hits on multiples are few and far between.
It's not a problem, it just means it's not Euler's sieve, because that attempts to eliminate each composite exactly once.
yes I see now. My bad. Wasn't reading that wikipedia page with enough attention to detail. It uses the modified (culled, so far) numbers to find out the next multiples to be eliminated,
Minor factual error, no big deal.
not just the prime itself like my code does.
Which is operationally far better because it's much simpler :)
You solution is indeed, exponential:
euler ks@(p:rs) = p : euler (rs `minus` map (*p) ks) primes = 2:euler [3,5..]
primes = 2:euler (as@[3,5..]) 3:euler (bs@(tail as `minus` map (3*) as)) 5:euler (cs@(tail bs `minus` map (5*) bs))
There are two separate look-back pointers to /as/ in /bs/, and there are two such in /cs/, to /bs/. The book-keeping machinery explodes.
Also, it uses no filters, i.e. no individual number divisibility testing. The "filters" refers first of all to testing an individual number to decide whether to keep it in or not.
Umm, the postponed filters says keep everything until p^2, then eliminate (filter out, remove) multiples of p in the remainder, after that, pick next prime. That's precisely what the above does. It doesn't do the filtering out by divisibility testing but by minus (hence more efficiently). I would say that's a variant of the postponed filters.
Filter is usually (as in Haskell's 'filter') is about testing individual elements by a predicate function. There is of course a filtering effect in two lists elts' comparison that 'minus' performs, so that's debatable. Even the PQ code performs "filtering" in this wider sense.
I understood the 'filters' in postponed filters in that wider sense. And I would also find it perfectly acceptable to say that the PQ code filters out the composites from the stream. Of course if you're used to using 'filter' only in the stricter sense, you wouldn't call the `minus` thing a variant of the postponed filters.
not just primes - all the composites not yet removed, too.
Between p and p^2, there are only primes left, fortunately.
So it can't even be implemented on shared mutable storage if we want to delay the removal (as we must, in unbounded case) -
Yes. And it's not nice in the bounded case either.

Daniel Fischer
Am Dienstag 05 Januar 2010 14:49:58 schrieb Will Ness:
... There are two attempts to eliminate 45.
I would say there are two requests to not have 45 in the output.
Thers are many possible ways to phrase it.
You solution is indeed, exponential:
euler ks@(p:rs) = p : euler (rs `minus` map (*p) ks) primes = 2:euler [3,5..]
primes = 2:euler (as@[3,5..]) 3:euler (bs@(tail as `minus` map (3*) as)) 5:euler (cs@(tail bs `minus` map (5*) bs))
Re-write: euler s = head s:euler (tail s `minus` map(head s*) s) primes = euler [2..] primes = euler $ rollFrom [2] 1 = 2:euler ( rollFrom [3] 1 `minus` map(2*) (rollFrom [2] 1)) ) rollFrom [3,4] 2 `minus` rollFrom [4] 2 -- rollFrom [3] 2 -- = 2:3:euler (rollFrom [5] 2 `minus` map(3*) (rollFrom [3] 2)) rollFrom [5,7,9] 6 `minus` rollFrom [9] 6 -- rollFrom [5,7] 6 -- = 2:3:5:euler (rollFrom [7,11] 6 `minus` rollFrom [25,35] 30) [7,11, 13,17, 19,23, 25,29, 31,35] 30 -- rollFrom [7,11,13,17,19,23,29,31] 30 -- = ..... where rollOnce (x:xs) by = (x, tail xs ++ [x+by]) rollFrom xs by = concat $ iterate (map (+ by)) (xs) multRoll xs@(x:_) by p = takeWhile (< (x+p*by)) $ rollFrom xs by so, reifying, we get data Roll a = Roll [a] a rollOnce (Roll (x:xs) by) = (x,Roll (xs ++ [x+by]) by) rollFrom (Roll xs by) = concat $ iterate (map (+ by)) (xs) multRoll r@(Roll (x:_) by) p = Roll (takeWhile (< (x+p*by)) $ rollFrom r) (by*p) primes = euler $ Roll [2] 1 euler r@(Roll xs _) = x:euler (Roll (mxs `minus` map (x*) xs) mby) where (x,r') = rollOnce r (Roll mxs mby) = multRoll r' x There's much extra primes pre-calculated inside the Roll, of course (upto p^2 in fact, for p = primes!!n ), so this needs to be somehow tapped into, by writing a specialized nthPrime n = .... to be called instead of (!! n), and also primesUpTo n = .... This calculated primes !! 1000 == 7927 in 3 seconds for me, interpreted, on my old slowish laptop. It's supposed to have all the primes upto 7927^2 = 62837329 inside the Roll (if I'm not mistaken - or is it?). That's about 3.72 millionth prime, according to WolframAlpha. (nah, it cant be that much). But it is surely not horribly slow. Is this, in fact, the wheels' "spiral"?
not just primes - all the composites not yet removed, too.
Between p and p^2, there are only primes left, fortunately.
but (map (*p) ks) needs access to the original, non-culled numbers - primes, composites and all. (?)
So it can't even be implemented on shared mutable storage if we want to delay the removal (as we must, in unbounded case) -
Yes. And it's not nice in the bounded case either.
_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe <at> haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

Will Ness
Daniel Fischer
writes: Am Dienstag 05 Januar 2010 14:49:58 schrieb Will Ness:
euler ks@(p:rs) = p : euler (rs `minus` map (*p) ks) primes = 2:euler [3,5..]
Re-write:
primes = euler $ rollFrom [2] 1 = 2:euler ( rollFrom [3] 1 `minus` map(2*) (rollFrom [2] 1)) ) rollFrom [3,4] 2 `minus` rollFrom [4] 2 -- rollFrom [3] 2 -- = 2:3:euler (rollFrom [5] 2 `minus` map(3*) (rollFrom [3] 2)) rollFrom [5,7,9] 6 `minus` rollFrom [9] 6 -- rollFrom [5,7] 6 -- = 2:3:5:euler (rollFrom [7,11] 6 `minus` rollFrom [25,35] 30) [7,11, 13,17, 19,23, 25,29, 31,35] 30 -- rollFrom [7,11,13,17,19,23,29,31] 30 -- = .....
correction: where rollOnce (x:xs) by = (x, xs ++ [x+by]) rollFrom xs by = concat $ iterate (map (+ by)) (xs) multRoll xs@(x:_) by p = takeWhile (< (x+p*by)) $ rollFrom xs by
so, reifying, we get
data Roll a = Roll [a] a
rollOnce (Roll (x:xs) by) = (x,Roll (xs ++ [x+by]) by) rollFrom (Roll xs by) = concat $ iterate (map (+ by)) (xs) multRoll r@(Roll (x:_) by) p = Roll (takeWhile (< (x+p*by)) $ rollFrom r) (by*p)
primes = euler $ Roll [2] 1 euler r@(Roll xs _) = x:euler (Roll (mxs `minus` map (x*) xs) mby) where (x,r') = rollOnce r (Roll mxs mby) = multRoll r' x
There's much extra primes pre-calculated inside the Roll, of course. For any (Roll xs@(x:_) _), (takeWhile (< x*x) xs) are all primes too. When these are used, the code's complexity is around O(n^1.5), and it runs about 1.8x slower than Postponed Filters. The "faithful sieve"'s empirical complexity is above 2.10..2.25 and rising. So it might not be exponential, bbut is worse than power it seems anyway.

Daniel Fischer
Am Sonntag 03 Januar 2010 09:54:37 schrieb Will Ness:
The quesion of a memory blowup with the treefolding merge still remains. For some reason using its second copy for a feeder doesn't reduce the memory (as reported by standalone compiled program, GHCi reported values are useless) - it causes it to increase twice.....
I have a partial solution. The big problem is that the feeder holds on to the beginning of comps while the main runner holds on to a later part. Thus the entire segment between these two points must be in memory.
So have two lists of composites (yeah, you know that, but it didn't work so far).
But you have to force the compiler not to share them: enter -fno-cse. The attached code does that (I've also expanded the wheel), it reduces the memory requirements much (a small part is due to the larger wheel, a factor of ~5 due to the non-sharing).
I don't understand. What is there to be shared? Each multiples list is consumed only at one point; there's nothing to be shared. Do you mean the compiler still hangs on to them? If so, why?? I used the switch; it didn't help at all. The only thing I can see is different is that all my interim data which I named with inner vars you moved out to the top level as functions. Is that what did the trick? What would be the reason to hang on to the already consumed data that is inaccessible to any active consumer? Why not make the forgetful behaviour the norm - especially where remembering is pointless??
It still uses much more memory than the PQ, and while the PQ's memory requirements grow very slowly, the tree-fold merge's still grow rather fast (don't go much beyond the 10,000,000th prime), I'm not sure why.
You did it! It's now 7M for 1,000,000th prime, instead of 52M before. Making the pattern lazy in mergeSP was probably an important fix too. :) Unfortunately it grows, as you've said - 23MB for 2 mln. :| PQ stays at just 2MB.
Attachment (V13Primes.hs): text/x-haskell, 3621 bytes
_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe <at> haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

Am Montag 04 Januar 2010 18:08:42 schrieb Will Ness:
Daniel Fischer
writes: Am Sonntag 03 Januar 2010 09:54:37 schrieb Will Ness:
The quesion of a memory blowup with the treefolding merge still remains. For some reason using its second copy for a feeder doesn't reduce the memory (as reported by standalone compiled program, GHCi reported values are useless) - it causes it to increase twice.....
I have a partial solution. The big problem is that the feeder holds on to the beginning of comps while the main runner holds on to a later part. Thus the entire segment between these two points must be in memory.
So have two lists of composites (yeah, you know that, but it didn't work so far).
But you have to force the compiler not to share them: enter -fno-cse. The attached code does that (I've also expanded the wheel), it reduces the memory requirements much (a small part is due to the larger wheel, a factor of ~5 due to the non-sharing).
I don't understand. What is there to be shared? Each multiples list is consumed only at one point; there's nothing to be shared. Do you mean the compiler still hangs on to them? If so, why??
Okay, let's look at the code. Your Primes.hs says {-# SPECIALIZE primes :: () -> [Int] #-} {-# SPECIALIZE primes :: () -> [Integer] #-} primes :: Integral a => () -> [a] primes () = 2:3:5:7:primes' where primes' = [11,13] ++ drop 2 (rollFrom 11) `minus` comps (comps,_) = tfold mergeSP (pairwise mergeSP mults) mults = map (\p-> fromList $ map (p*) $ rollFrom p) $ primes' fromList (x:xs) = ([x],xs) What's going on evaluating primes'? First, we know primes' is 11 : 13 : more. Then we can start looking at 'more'. For that, we need rollFrom 11 `minus` comps. rollFrom 11 is easy. What about comps? We need the start of primes' for that. Cool, we already know the first two elements, start tfolding, we know comps is 121:143:169:notYetKnown So the rollFrom 11 produces happily, minus doesn't veto anything for a while, then says no to 121, 143, 169. Now rollFrom 11 says "look at 173". Fine, what's next in comps? To find that out, we need the multiples of 17 and 19 (and 23 and 29, but those can still be deferred for a while). That gives ([289,323,361],whatever), then the mergeSP with the merged multiples of 11 and 13 produces (187:209:221:...) for notYetKnown. Everything's going on fine, but the problem is, when primes' produces primes in the region of n, it needs the comps in that region. Those contain the multiples of primes in the region of sqrt n, thus the small primes cannot yet be released. Bottom line, to produce p, all the primes from about sqrt p to p must be in memory. Oops. Enter the feeder. Say primes :: Integral a => () -> [a] primes () = 2:3:5:7:primes' where primes' = (rollFrom 11) `minus` comps primes'' = [11,13] ++ drop 2 (rollFrom 11) `minus` comps (comps,_) = tfold mergeSP (pairwise mergeSP mults) mults = map (\p-> fromList $ map (p*) $ rollFrom p) $ primes'' fromList (x:xs) = ([x],xs) Now primes'', the feeder of composites advances much more slowly and for primes' to produce p, primes'' must produce the next prime after sqrt p, thus holds on to all primes between sqrt (sqrt p) and sqrt p - not much of a problem. primes' doesn't hold on to smaller primes, great. But! primes' needs the part of comps around p. primes'' needs the part of comps around sqrt p. That means we must keep the part of the list of composites from sqrt p to p in memory. Even after removing the multiples of 2, 3, 5 and 7, it doesn't take long until there are far more composites between sqrt p and p than primes. Double oops. So we must make sure that the list of composites that primes' consumes is not the same as that which primes'' consumes. You can try primes :: Integral a => () -> [a] primes () = 2:3:5:7:primes' where primes' = (rollFrom 11) `minus` comps primes'' = [11,13] ++ drop 2 (rollFrom 11) `minus` comps2 (comps,_) = tfold mergeSP (pairwise mergeSP mults) mults = map (\p-> fromList $ map (p*) $ rollFrom p) $ primes'' (comps2,_) = tfold mergeSP (pairwise mergeSP mults2) mults2 = map (\p-> fromList $ map (p*) $ rollFrom p) $ primes'' fromList (x:xs) = ([x],xs) If you compile without optimisations, it works. It just is rather slow. Unfortunately, the optimiser sees that comps2 and comps are the same and thinks, "why should I produce it twice if they're the same?" - whoops, back to square two. Tell the compiler that you do *not* want it shared via -fno-cse and it isn't shared, you really have two distinct lists of composites, happiness ensues - almost.
I used the switch; it didn't help at all. The only thing I can see is different is that all my interim data which I named with inner vars you moved out to the top level as functions. Is that what did the trick?
Not really, the above works too, with inner local bindings. If you introduce separate names for the lists of multiples and for the lists of composites, -fno-cse ought to do the trick. Maybe it would even work with one list of multiples. Can I see what you did that -fno-cse didn't manage to separate the lists?
What would be the reason to hang on to the already consumed data that is inaccessible to any active consumer?
No reason. And that's not what happened. We always had two consumers gnawing on the same list, one proceeding fast, the other slow. The part the slow consumer was done with could be dropped (and probably was) - the part between the slow and the fast: Not.
Why not make the forgetful behaviour the norm - especially where remembering is pointless??
It is.
It still uses much more memory than the PQ, and while the PQ's memory requirements grow very slowly, the tree-fold merge's still grow rather fast (don't go much beyond the 10,000,000th prime), I'm not sure why.
You did it! It's now 7M for 1,000,000th prime, instead of 52M before. Making the pattern lazy in mergeSP was probably an important fix too. :)
Did I forget to remove them? (checks) Yes. No, I put them in without thinking, then realised "Hey, stupid, where-bindings *are* already lazy!" But somehow I have forgotten to remove them. Since let- and where-bindings are lazy, all those could do is make the compiler say "Sheesh".
Unfortunately it grows, as you've said - 23MB for 2 mln. :|
And I've found out why. Change the definition of tfold to tfold f (a: ~(b: ~(c:xs))) = (a `f` (b `f` c)) `f` tfold f xs and memory stays low (things are going much slower, though). By always doing a (pairwise f xs), you're getting the composites faster, but at the expense of pretty huge thunks (I think, if the merging isn't lazy enough, you get some pretty long lists). You can make a compromise by using the above tfold (which is no longer a tree-fold) and grouping (and merging) the multiples in a slower-growing manner, e.g. compos ps = fst (tfold mergeSP $ nwise 1 mergeSP (pairwise mergeSP (multip ps))) tfold f (a: ~(b: ~(c:xs))) = (a `f` (b `f` c)) `f` tfold f xs nwise k f xs = let (ys,zs) = splitAt k xs in rfold f ys : nwise (k+1) f zs rfold f [x] = x rfold f (x:xs) = x `f` rfold f xs memory still grows, but much slower, in my tests, due to the much smaller GC time, it's a bit faster than the version with the original tfold.
PQ stays at just 2MB.
No doubling the size of merge trees, no big thunks, just occasionally add a filter to the PQ. Memory requirements are O(pi(sqrt n)) [off the top of my head, may be wrong again],

Am Montag 04 Januar 2010 22:25:28 schrieb Daniel Fischer:
compos ps = fst (tfold mergeSP $ nwise 1 mergeSP (pairwise mergeSP (multip ps)))
tfold f (a: ~(b: ~(c:xs))) = (a `f` (b `f` c)) `f` tfold f xs
nwise k f xs = let (ys,zs) = splitAt k xs in rfold f ys : nwise (k+1) f zs
rfold f [x] = x rfold f (x:xs) = x `f` rfold f xs
memory still grows, but much slower, in my tests, due to the much smaller GC time, it's a bit faster than the version with the original tfold.
Not for larger inputs (but not so large that the tree-fold dies OOM). Fix rfold: rfold f [x] = x rfold f xs = rfold f (pairwise f xs) and it's faster also for those.

Daniel Fischer
Am Montag 04 Januar 2010 22:25:28 schrieb Daniel Fischer:
memory still grows, but much slower, in my tests, due to the much smaller GC time, it's a bit faster than the version with the original tfold.
Not for larger inputs (but not so large that the tree-fold dies OOM). Fix rfold:
rfold f [x] = x rfold f xs = rfold f (pairwise f xs)
and it's faster also for those.
Niiice!!!! This is just great! :) I tried a two-step feed BTW (that's three separate sets of lists) , with the original structure. It ran with same speed as your new version (10..20% faster) but with the memory of the previous one :) (80M for 8 mil primes vs the new one's 10M). But your new structure is just great! I hoped there is something better, that's why I posted it here in the first place. 'pairwise' puts odd leafs higher on the right. It might be better if it was so on the left, for the frequency of production is higher. Thanks a lot for your comments!
_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe <at> haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

Am Mittwoch 06 Januar 2010 00:09:07 schrieb Will Ness:
Daniel Fischer
writes: Am Montag 04 Januar 2010 22:25:28 schrieb Daniel Fischer:
memory still grows, but much slower, in my tests, due to the much smaller GC time, it's a bit faster than the version with the original tfold.
Not for larger inputs (but not so large that the tree-fold dies OOM). Fix rfold:
rfold f [x] = x rfold f xs = rfold f (pairwise f xs)
and it's faster also for those.
Niiice!!!! This is just great! :)
I tried a two-step feed BTW (that's three separate sets of lists) , with the original structure. It ran with same speed as your new version (10..20% faster) but with the memory of the previous one :) (80M for 8 mil primes vs the new one's 10M).
The memory is almost completely due to the tree-merging of the multiples for the fastest runner. While it produces faster than flat merging, the exponential growth of the trees makes a bad memory citizen. With the nwise and rfold, a two-step (or even three-step) feeding doesn't gain nearly as much (I found about 1% speedup).
But your new structure is just great! I hoped there is something better, that's why I posted it here in the first place.
I have two more goodies :) 1. now that the trees don't grow so fast, don't use lazy patterns in the definition of tfold: tfold f (a:b:c:xs) = (a `f` (b `f` c)) `f` tfold f xs gains something like 6-7% here (and uses a little less memory). 2. Now we have a big wheel, 5760 differences per period. Then dropping a few thousand elements from the wheel after calculating how many in rollFrom is slow: rollFrom n = go ((n-17) `rem` 30030) wheel13 where go r xxs@(x:xs) | r < x = roll n xxs | otherwise = go (r-x) xs gains another couple of percents for large targets (~1% for the 10M prime, ~2% for 20M, I expect that to stay in th region of 1.5-3% for larger targets).
'pairwise' puts odd leafs higher on the right. It might be better if it was so on the left, for the frequency of production is higher.
Maybe. But how would you do it? I tried passing the length to rfold, so when there was an odd numberof trees in the list, it would move the first out of the recursion. Any possible gains in production have been more than eaten up by the control code (not a big difference, but it was there).
Thanks a lot for your comments!

Daniel Fischer
Am Mittwoch 06 Januar 2010 00:09:07 schrieb Will Ness:
Daniel Fischer
writes: Am Montag 04 Januar 2010 22:25:28 schrieb Daniel Fischer: Fix rfold:
rfold f [x] = x rfold f xs = rfold f (pairwise f xs)
and it's faster also for those.
The memory is almost completely due to the tree-merging of the multiples for the fastest runner. While it produces faster than flat merging, the exponential growth of the trees makes a bad memory citizen.
Isn't the number of nodes the same in any two trees with the same number of leafs? BTW using compos ps = fst $ tfold mergeSP $ nwise 1 mergeSP $ map pmults ps instead of compos ps = fst $ tfold mergeSP $ nwise 1 mergeSP $ pairwise mergeSP $ map pmults ps brings down memory consumption further by 25%..45% on 1..8 mln primes produced, while slowing it down by about 0%..2% (that's after eliminating the lazy pattern in tfold as per your advice).
'pairwise' puts odd leafs higher on the right. It might be better if it was so on the left, for the frequency of production is higher.
Maybe. But how would you do it? I tried passing the length to rfold, so when there was an odd numberof trees in the list, it would move the first out of the recursion. Any possible gains in production have been more than eaten up by the control code (not a big difference, but it was there).
yes I've seen this too now. BTW, at a price of further slowing down, memory can be lowered yet more with compos ps = fst $ tfold mergeSP $ nwise 1 0.4 mergeSP $ map pmults ps nwise k d f xs = let (ys,zs) = splitAt (round k) xs in rfold f ys : nwise (k+d) d f zs It really looks like the nearer the structure is to linear list, the lower the memory consumption becomes. Of course using 0.0 in place of 0.4 would make it into a plain list.

Am Mittwoch 06 Januar 2010 19:13:01 schrieb Will Ness:
Daniel Fischer
writes: Am Mittwoch 06 Januar 2010 00:09:07 schrieb Will Ness:
Daniel Fischer
writes: Am Montag 04 Januar 2010 22:25:28 schrieb Daniel Fischer: Fix rfold:
rfold f [x] = x rfold f xs = rfold f (pairwise f xs)
and it's faster also for those.
The memory is almost completely due to the tree-merging of the multiples for the fastest runner. While it produces faster than flat merging, the exponential growth of the trees makes a bad memory citizen.
Isn't the number of nodes the same in any two trees with the same number of leafs?
Sure. And I don't know why it takes *that much* memory, but since a flat merge doesn't consume much memory, it must be the trees.
BTW using
compos ps = fst $ tfold mergeSP $ nwise 1 mergeSP $ map pmults ps
instead of
compos ps = fst $ tfold mergeSP $ nwise 1 mergeSP $ pairwise mergeSP $ map pmults ps
brings down memory consumption further by 25%..45% on 1..8 mln primes produced, while slowing it down by about 0%..2% (that's after eliminating the lazy pattern in tfold as per your advice).
So much? Wow. I forgot the numbers, but I had tried that too, I thought the memory gain wasn't worth the speed loss. Another thing that reduces memory usage is compos :: [a] -> [a] compos ps = case pairwise mergeSP (multip ps) of ((a,b):cs) -> a ++ funMerge b (nwise 1 mergeSP $ pairwise mergeSP cs) funMerge b (x:y:zs) = let (c,d) = mergeSP x y in mfun b c d zs mfun u@(x:xs) w@(y:ys) d l = case compare x y of LT -> x:mfun xs w d l EQ -> x:mfun xs ys d l GT -> y:mfun u ys d l mfun u [] d l = funMerge (merge u d) l That's about the same speed as the latter of the above here and uses about 25% less memory. Removing one or two 'pairwise's brings memory down further, but makes it slower.

Daniel Fischer
So we must make sure that the list of composites that primes' consumes is not the same as that which primes'' consumes.
yes that is what I had done too. Duplicated everything. Turns out, it works exactly as you told it would when using the compiler switch, -fno-cse, thanks!
I used the switch; it didn't help at all. The only thing I can see is
wrong. I didn't. When I did, it worked.
Unfortunately it grows, as you've said - 23MB for 2 mln. :|
And I've found out why. Change the definition of tfold to
tfold f (a: ~(b: ~(c:xs))) = (a `f` (b `f` c)) `f` tfold f xs
and memory stays low (things are going much slower, though).
(forced by gmane poster to delete unusually many of your comments today...) Interesting... As for the structure, I chose it trying to minimize the estimated average cost of a composite production, Sum (1/p)*depth.
You can make a compromise by using the above tfold (which is no longer a tree-fold) and grouping (and merging) the multiples in a slower-growing manner, .......
memory still grows, but much slower, in my tests, due to the much smaller GC time, it's a bit faster than the version with the original tfold.
Great! :)

Am Dienstag 29 Dezember 2009 20:16:59 schrieb Daniel Fischer:
especially the claim that going by primes squares is "a pleasing but minor optimization",
Which it is not. It is a major optimisation. It reduces the algorithmic complexity *and* reduces the constant factors significantly.
D'oh! Thinko while computing sum (takeWhile (<= n) primes) without paper. It doesn't change the complexity, and the constant factors are reduced far less than I thought. The huge performance differences I had must have been due to cache misses (unless you use a segmented sieve, starting at the square reduces the number of cache misses significantly).

Daniel Fischer
Am Dienstag 29 Dezember 2009 20:16:59 schrieb Daniel Fischer:
especially the claim that going by primes squares is "a pleasing but minor optimization",
Which it is not. It is a major optimisation. It reduces the algorithmic complexity *and* reduces the constant factors significantly.
D'oh! Thinko while computing sum (takeWhile (<= n) primes) without paper. It doesn't change the complexity, and the constant factors are reduced far less than I thought.
I do not understand. Turner's sieve is primes = sieve [2..] where sieve (p:xs) = p : sieve [x | x<-xs, x `mod` p /= 0] and the Postponed Filters is primes = 2: 3: sieve (tail primes) [5,7..] where sieve (p:ps) xs = h ++ sieve ps [x | x<-t, x `rem` p /= 0] where (h,~(_:t)) = span (< p*p) xs Are you saying they both exhibit same complexity? I was under impression that the first shows O(n^2) approx., and the second one O(n^1.5) (for n primes produced).

Am Mittwoch 30 Dezember 2009 20:46:57 schrieb Will Ness:
Daniel Fischer
writes: Am Dienstag 29 Dezember 2009 20:16:59 schrieb Daniel Fischer:
especially the claim that going by primes squares is "a pleasing but minor optimization",
Which it is not. It is a major optimisation. It reduces the algorithmic complexity *and* reduces the constant factors significantly.
D'oh! Thinko while computing sum (takeWhile (<= n) primes) without paper. It doesn't change the complexity, and the constant factors are reduced far less than I thought.
I do not understand. Turner's sieve is
primes = sieve [2..] where sieve (p:xs) = p : sieve [x | x<-xs, x `mod` p /= 0]
and the Postponed Filters is
primes = 2: 3: sieve (tail primes) [5,7..] where sieve (p:ps) xs = h ++ sieve ps [x | x<-t, x `rem` p /= 0] where (h,~(_:t)) = span (< p*p) xs
Are you saying they both exhibit same complexity?
No. They don't. But if you're looking at an imperative (mutable array) sieve (that's simpler to analyse because you don't have to take the book-keeping costs of your priority queue, heap or whatever into account), if you start crossing out the multiples of p with 2*p, you have sum [bound `div` p - 1 | p <- takeWhile (<= sqrt bound) primes] crossings-out, that is Theta(bound*log (log bound)). If you eliminate multiples of some small primes a priori (wheel), you can reduce the constant factor significantly, but the complexity remains the same (you drop a few terms from the front of the sum and multiply the remaining terms with phi(n)/n, where n is the product of the excluded primes). If you start crossing out at p^2, the number is sum [bound `div` p - (p-1) | p <- takeWhile (<= sqrt bound) primes]. The difference is basically sum (takeWhile (<= sqrt bound) primes), which I stupidly - I don't remember how - believed to cancel out the main term. It doesn't, it's O(bound/log bound), so the complexity is the same. Now if you take a stream of numbers from which you remove composites, having a priority queue of multiples of primes, things are a little different. If you start crossing out at 2*p, when you are looking at n, you have more multiples in your PQ than if you start crossing out at p^2 (about pi(n/2) vs. pi(sqrt n)), so updating the PQ will be more expensive. But updating the PQ is O(log size), I believe, and log pi(n) is O(log pi(sqrt n)), so I think it shouldn't change the complexity here either. I think this would have complexity O(bound*log bound*log (log bound)).
I was under impression that the first shows O(n^2) approx., and the second one O(n^1.5) (for n primes produced).
In Turner/postponed filters, things are really different. Actually, Ms. O'Neill is right, it is a different algorithm. In the above, we match each prime only with its multiples (plus the next composite in the PQ version). In Turner's algorithm, we match each prime with all smaller primes (and each composite with all primes <= its smallest prime factor, but that's less work than the primes). That gives indeed a complexity of O(pi(bound)^2). In the postponed filters, we match each prime p with all primes <= sqrt p (again, the composites contribute less). I think that gives a complexity of O(pi(bound)^1.5*log (log bound)) or O(n^1.5*log (log n)) for n primes produced.

Daniel Fischer
Am Dienstag 29 Dezember 2009 04:38:21 schrieb Will Ness:
Now _this_, when tested as interpreted code in GHCi, runs about 2.5x times faster than Priority Queue based code from Melissa O'Neill's ZIP package mentioned at the haskellwiki/Prime_Numbers page, with about half used memory reported, in producing 10,000 to 300,000 primes.
It is faster than BayerPrimes.hs from the ZIP package too, in the tested range, at about 35 lines of code in total.
That's nice. However, the important criterion is how compiled code (-O2)
fares. Do the relations continue to hold? How does it compare to a bitsieve?
OK, I've tested it now. For some reason it runs about 1.3x times slower than Melissa O'Neill's code when compiled, while taking about the same amount of memory (going slightly worse actually, 0.97..1.05..1.08), in the range of 100,000..1,000,000..2,000,000 primes produced. The local asymptotic behavior was about the same, again, in both versions, about O(n^1.20..1.25) - worsening slightly for the merging version (the ratio of run times going 1.25..1.29..1.32). I guess that makes the two versions (almost) operationally equivalent in producing of up to a million primes or two.
_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe <at> haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
participants (5)
-
Daniel Fischer
-
Emil Axelsson
-
Eugene Kirpichov
-
Heinrich Apfelmus
-
Will Ness