Re: [Haskell-cafe] Re: FASTER primes

9 Jan 2010

      Am Freitag 08 Januar 2010 19:45:47 schrieb Will Ness:
...
Daniel Fischer  writes:
...
...
mergeSP :: Integral a => People a -> People a -> People a
mergeSP p1@(P a _) p2 = P (a ++ vips mrgd) (dorks mrgd)
      where
        mrgd = spMerge (dorks p1) (vips p2) (dorks p2)
        spMerge l1 [] l3 = P [] (merge l1 l3)
        spMerge ~l1@(x:xs) l2@(y:ys) l3 = case compare x y of
                LT -> celebrate x (spMerge xs l2 l3)
                EQ -> celebrate x (spMerge xs ys l3)
                GT -> celebrate y (spMerge l1 ys l3)
----------------------------------------------------------------------
Hi Daniel,
Is it so that you went back to my fold structure?
Yes.
...
Was it better for really big numbers of primes?
Yes, it is slower for <= 20 million primes (and some beyond), but faster 
for >= 50 million (and some before), according to the few tests I made.
...
I had the following for ages (well, at least two weeks) but I thought it
was slower and took more memory (that was _before_ the 'no-share' and
'feeder' stuff). I can see the only difference in that you've re-written
spMerge in a tail-recursive style with explicitly deconstructed parts;
It's not tail-recursive, the recursive call is inside a celebrate.

As it turns out, the important things are

1. a feeder and separate lists of multiples for the feeder and the runner, 
for the reasons detailed earlier (two-step feeding and larger wheel are 
pleasing but minor optimisations).

2. a strict pattern in tfold

3. moving the merge inside spMerge (you can have

mergeSP (a,b) ~(c,d) = (a ++ bc, m)
   where
      (bc,m) = spMerge b c
      spMerge u [] = ([],merge u d)
      ...

without exploding memory, but it's *much* slower than letting spMerge take 
three arguments)

Now, I have to admit, the only point I _really_ understand is 1. (and why 
the three-argument spMerge is faster than the two-argument one taking the 
merge-partner from mergeSP's binding :).

Why has

mergeSP (a,b) ~(c,d)
   = let (bc,b') = spMerge b c in (a ++ bc, merge b' d)

a memory leak, but

mergeSP (a,b) ~(c,d)
   = let (bc,m) = spMerge' b c d in (a ++ bc, m)

not?

Well, looking at the core for mergeSP, the fog clears somewhat. The former 
is translated roughly to 

mergeSP (a,b) pair
   = let sm = spMerge b (fst pair) 
     in (a ++ fst sm, merge (snd sm) (snd pair))

It holds on to the pair until the result of the merge is demanded, that is 
until the entire (a ++ fst sm) is consumed. Only then is the pair released 
and can be collected. On top of that, as soon as a is consumed and (fst sm) 
[or bc] is demanded, spMerge forces the evaluation of (fst pair) [c]. After 
a few levels, the evaluated list will take more space than the thunk. It 
cannot yet be collected, because pair is still alive. The elements have to 
be duplicated for (fst sm), because they're intertwined with those of b.
On the next level of merging, they have to be duplicated again.

The latter is translated roughly to

mergeSP (a,b) pair
   = let sm = spMerge' b (fst pair) (snd pair)
     in (a ++ fst sm, snd sm)

The pair is released as soon as the result of the spMerge' is demanded, 
that is, when a is consumed. Then the elements of (fst pair) need not be 
duplicated and they can be discarded almost immediately after they have 
been produced [for small primes, multiples of larger primes live a little 
longer, but those are fewer].

So, no duplication, shorter lifespan => no leak.
Having seen that, the question is, why can't the compiler see that 
deconstructing the pair when the first component is needed is better? The 
first component of the pair is used in no other place, so keeping the pair 
can't have any advantage, can it?

And why does

tfold f (a: ~(b: ~(c:xs))) = ...

leak, but not

tfold f (a:b:c:xs) = ...

?

I guess it's similar.

tfold f (a: ~(b: ~(c:xs))) 
   = (a `f` (b `f` c)) `f` tfold f ([pairwise f] xs)

is

tfold f (a:zs) 
   = (a `f` (head zs `f` (head $ tail zs))) 
        `f` tfold f (pairwise f $ drop 2 zs)

the latter part holds on to the beginning of zs, leading again to data 
duplication and too long lifespans.
...
mine was relying on the compiler (8-L) to de-couple the two pipes and
recognize that the second just passes along the final result, unchanged.
The two versions seem to me to be _exactly_ operationally equivalent.
Well, they're not. The main difference is what we told the compiler _when_ 
to deconstruct the pairs. You said "at the latest possible moment", I said 
"as soon as we need the first component".

It's not entirely obvious, but it is frequently said that understanding the 
space (and time) behaviour of lazy evaluation isn't always easy.
...
All this searching for the code better understood by the compiler is
_*very*_ frustrating, as it doesn't reflect on the semantics of the
code, or even the operational semantics of the code.  :-[
Weren't the P's supposed to disappear completely in the compiled code?
No. Constructors for data types can't disappear completely, only newtype 
constructors can (and do).
...
Aren't types just a _behavioral_ definitions??? Aren't we supposed to be
able to substitute equals for equals dammit??
We can. But substituting equals for equals can alter space and time 
complexity.

fromInteger 0 == 
      fromInteger (let xs = [1 .. 10^8] in (product xs + sum xs)  `mod` 10)
...
Is this the state of our _best_ Haskell compiler????
Yes. It's still a "do what I tell you to" compiler, even if a pretty slick 
one, not a "do what I mean" compiler. Sometimes, what you tell the compiler 
isn't what you wanted.
It's easier to predict when you give detailed step by step instructions.