V.I.P.s and the associativity of merge'

Ok, after mulling over the issues that Will Ness has brought up in the last few days [1], I think I have a partial explanation for the apparent tension between Will's observations and Heinrich Apfelmus's Implicit Heaps article [2], which both concern the implementation of mergeAll [3]. The merge' function takes two ordered lists, with the head of the first list less than the head of the second, and merges their contents: merge' [] ys = ys merge' (x:xs) ys = x : merge xs ys The nice thing about this function is we can merge an infinite number of lists by folding right, if assume that the heads of those lists are appropriately ordered. This appears in Richard Bird's code at the end of Melissa O'Neill's "Genuine Sieve of Eratosthenes", though undoubtedly this observation has been independently made by many people. Now, in an ordinary sense, merge' *is* an associative operator, in that given three fully defined ordered lists with ordered heads, merge' xs (merge' ys zs) == merge' (merge' xs ys) zs. This allows us to merge an infinite number of lists using an arbitrary tree of merge' operations, without ever worrying that we will return an incorrect result. (However, we might get stuck in a non-productive loop, for example if we fold left over an infinite list of ordered lists) However, Heinrich's article uses a stronger sense of associativity which includes laziness properties and thus captures some operational characteristics. In this sense, merge' *is not* associative. To remedy this, Heinrich uses VIPs to strengthen the associativity property of merge'. This raises the question, is there some combination of the shape of the merge' tree and some input for which using VIPs dramatically changes the efficiency of a mergeAll operation? I suspect the answer is yes, though I don't know for sure at this point in time. However, I do tacitly believe that the current tree that mergeAll uses doesn't exhibit this property for any input, and so I have simplified the implementations of mergeAll and unionAll in the latest version of data-ordlist-0.4.4 by avoiding the use of VIPs. This has the nice side benefit of modestly improving performance when the elements of the result are highly biased towards the first list. Best, Leon [1] http://permalink.gmane.org/gmane.comp.lang.haskell.cafe/84666 [2] http://apfelmus.nfshost.com/articles/implicit-heaps.html [3] http://hackage.haskell.org/packages/archive/data-ordlist/0.4.4/doc/html/Data...

Leon Smith wrote:
Ok, after mulling over the issues that Will Ness has brought up in the last few days [1], I think I have a partial explanation for the apparent tension between Will's observations and Heinrich Apfelmus's Implicit Heaps article [2], which both concern the implementation of mergeAll [3].
[1] http://permalink.gmane.org/gmane.comp.lang.haskell.cafe/84666 [2] http://apfelmus.nfshost.com/articles/implicit-heaps.html [3] http://hackage.haskell.org/packages/archive/data-ordlist/0.4.4/doc/html/Data...
[...]
This raises the question, is there some combination of the shape of the merge' tree and some input for which using VIPs dramatically changes the efficiency of a mergeAll operation? I suspect the answer is yes, though I don't know for sure at this point in time.
However, I do tacitly believe that the current tree that mergeAll uses doesn't exhibit this property for any input, and so I have simplified the implementations of mergeAll and unionAll in the latest version of data-ordlist-0.4.4 by avoiding the use of VIPs. This has the nice side benefit of modestly improving performance when the elements of the result are highly biased towards the first list.
Will Ness
For those who remember the discussion about this about a year ago, it turns out there was a simpler version after all lurking somewhere in there (or is it _out_?).
primes = 2: primes' where primes' = 3: 5: [7,9..] `minus` tfold [ [p*p,p*p+2*p..] | p <- primes' ] tfold ((x:xs):t) = x : xs `union` tfold (pairs t) pairs ((x:xs):ys:t) = (x: union xs ys) : pairs t
Unfortunately, it turns out that this program is clear, shorter ... and subtly wrong. :) Here an example where the VIP merge would give a different result bad = tfold $ (1:10:undefined) : (2:3:5:undefined) : (4:undefined) : error "bad" We have ghci> bad [1,2*** Exception: bad but the VIP version would give at least ghci> bad [1,2,3,4*** Exception: bad / Prelude: undefined In other words, this new program already tries to compare the number 3 to the fourth list when it is still clear that only the first three lists are relevant. Note that this doesn't necessarily mean that the program does not work for prime numbers, but *proving* correctness is now considerably more difficult because you need estimates about the growth and distribution of prime numbers. The VIP version always works as long as there are infinitely many primes. Also, since unnecessary comparisons are performed, it is no longer clear whether the time and space complexity stays the same. (Which is not as bad as it sounds, since we didn't know them well in the first place anyway). More worryingly, changing the tree shape now affects correctness. Regards, Heinrich Apfelmus -- http://apfelmus.nfshost.com

Heinrich Apfelmus
Leon Smith wrote:
[1] http://permalink.gmane.org/gmane.comp.lang.haskell.cafe/84666 [2] http://apfelmus.nfshost.com/articles/implicit-heaps.html [3]
http://hackage.haskell.org/packages/archive/data-ordlist/0.4.4/doc/html/Data...
List-Ordered.html#v:mergeAll
Will Ness
primes = 2: primes' where primes' = 3: 5: [7,9..] `minus` tfold [ [p*p,p*p+2*p..] | p <- primes' ] tfold ((x:xs):t) = x : xs `union` tfold (pairs t) pairs ((x:xs):ys:t) = (x: union xs ys) : pairs t
Unfortunately, it turns out that this program is clear, shorter ... and subtly wrong. :)
Here an example where the VIP merge would give a different result
bad = tfold $ (1:10:undefined) : (2:3:5:undefined) : (4:undefined) : error "bad"
We have
ghci> bad [1,2*** Exception: bad
but the VIP version would give at least
ghci> bad [1,2,3,4*** Exception: bad / Prelude: undefined
The reason to *not* have the "lazy" patterns in foldTree for primes, as Daniel Fischer discovered back then, is that they give it a space leak. Case in point - http://ideone.com/DLHp2 : ------------- 1M primes: ---- 2M primes: --- 3M: --- ideone #: - no-VIPs: "smart" fold: 1.90s- 4.8MB 4.42s- 4.8MB 7.40s- 4.8MB r3bdL - VIPs: "smart" fold: 1.95s- 4.8MB 4.53s- 4.8MB 7.45s- 4.8MB 4ACpe simple fold: 2.04s- 4.8MB 4.76s- 4.8MB 7.86s- 4.8MB av9XR "lazy" pats: 2.44s-20.1MB 5.70s-21.1MB 9.85s-42.6MB DLHp2 Also, having tfold ((x:xs):t) = x : xs `merge` tfold (pairs t) where pairs ((x:xs):ys:t) = (x : merge xs ys) : pairs t hfold xs = serve . foldTree mergeP . map vip $ xs hfold' xs = serve . foldTree' mergeP . map vip $ xs foldTree f ~(x:xs) = x `f` foldTree f (pairs xs) where pairs ~(x: ~(y:ys)) = f x y : pairs ys foldTree' f (x:xs) = x `f` foldTree' f (pairs xs) where pairs (x: (y:ys)) = f x y : pairs ys and bad = (1:10:error "1") : (2:3:5:error "2") : (4:error "4") : error "bad" bad2 = (1:10:error "1") : (2:3:5:error "2") : (4:error "4") : (5:error "5") : (6:error "6") : (7:error "7") : error "bad2" we get *Main> hfold bad [1,2,3,4*** Exception: bad *Main> hfold' bad [1,2,3,4*** Exception: bad *Main> tfold bad [1,2*** Exception: bad *Main> hfold bad2 [1,2,3,4*** Exception: 4 *Main> hfold' bad2 [1,2,3,4*** Exception: bad2 *Main> tfold bad2 [1,2*** Exception: bad2 so hfold' too appears to be over-eager to-the-right, although still more productive than tfold.
Regards, Heinrich Apfelmus

Will Ness wrote:
Heinrich Apfelmus writes:
Here an example where the VIP merge would give a different result
bad = tfold $ (1:10:undefined) : (2:3:5:undefined) : (4:undefined) : error "bad" We have ghci> bad [1,2*** Exception: bad but the VIP version would give at least ghci> bad [1,2,3,4*** Exception: bad / Prelude: undefined
The reason to *not* have the "lazy" patterns in foldTree for primes, as Daniel Fischer discovered back then, is that they give it a space leak. Case in point - http://ideone.com/DLHp2 :
[...]
tfold ((x:xs):t) = x : xs `merge` tfold (pairs t) where pairs ((x:xs):ys:t) = (x : merge xs ys) : pairs t
hfold xs = serve . foldTree mergeP . map vip $ xs hfold' xs = serve . foldTree' mergeP . map vip $ xs
foldTree f ~(x:xs) = x `f` foldTree f (pairs xs) where pairs ~(x: ~(y:ys)) = f x y : pairs ys
foldTree' f (x:xs) = x `f` foldTree' f (pairs xs) where pairs (x: (y:ys)) = f x y : pairs ys
[...]
so hfold' too appears to be over-eager to-the-right, although still more productive than tfold.
Leon Smith wrote:
At a glance, it appears to have to do with the irrefutable patterns that treefold uses. At the moment, it's not obvious to me how your example could be made to "work" without either giving up the ability to work correctly on finite cases, or giving up the tree and going back to foldr merge'. ;-)
[...]
Also, is there other examples that could distinguish a VIP implementation that works on finite cases, and a no-VIP implementation? Is it possible to have a VIP example that works on finite cases and is maximally lazy in this example?
Ah, the lazy patterns in foldTree are a different issue, sorry for my bad choice of example. While I still don't understand the trade-off that turns the lazy patterns into a space leak, there is no harm in allowing foldTree to see the complete spine of the list. What we do want to forbid is looking at the elements of that list too early. In other words, the example should read bad = tfold $ (1:10:undefined) : (2:3:5:undefined) : (4:undefined) : repeat (error "bad" : undefined) i.e. the previously unknown tail is replaced with an infinite list of undefined elements. This example can properly distinguish between the not-so-correct tfold and proper VIP implementations (or other implementations that don't do unnecessary comparisons). Regards, Heinrich Apfelmus -- http://apfelmus.nfshost.com

Will Ness wrote:
Heinrich Apfelmus writes:
Here an example where the VIP merge would give a different result
bad = tfold $ (1:10:undefined) : (2:3:5:undefined) : (4:undefined) : error "bad" We have ghci> bad [1,2*** Exception: bad but the VIP version would give at least ghci> bad [1,2,3,4*** Exception: bad / Prelude: undefined
The reason to *not* have the "lazy" patterns in foldTree for primes, as Daniel Fischer discovered back then, is that they give it a space leak. Case in point - http://ideone.com/DLHp2 :
[...]
tfold ((x:xs):t) = x : xs `merge` tfold (pairs t) where pairs ((x:xs):ys:t) = (x : merge xs ys) : pairs t
hfold xs = serve . foldTree mergeP . map vip $ xs hfold' xs = serve . foldTree' mergeP . map vip $ xs
foldTree f ~(x:xs) = x `f` foldTree f (pairs xs) where pairs ~(x: ~(y:ys)) = f x y : pairs ys
foldTree' f (x:xs) = x `f` foldTree' f (pairs xs) where pairs (x: (y:ys)) = f x y : pairs ys
[...]
so hfold' too appears to be over-eager to-the-right, although still more productive than tfold.
Leon Smith wrote:
At a glance, it appears to have to do with the irrefutable patterns that treefold uses. At the moment, it's not obvious to me how your example could be made to "work" without either giving up the ability to work correctly on finite cases, or giving up the tree and going back to foldr merge'. ;-)
[...]
Also, is there other examples that could distinguish a VIP implementation that works on finite cases, and a no-VIP implementation? Is it possible to have a VIP example that works on finite cases and is maximally lazy in this example?
Ah, the lazy patterns in foldTree are a different issue, sorry for my bad choice of example. While I still don't understand the trade-off that turns the lazy patterns into a space leak, there is no harm in allowing foldTree to see the complete spine of the list. What we do want to forbid is looking at the elements of that list too early. In other words, the example should read bad = tfold $ (1:10:undefined) : (2:3:5:undefined) : (4:undefined) : repeat (error "bad" : undefined) i.e. the previously unknown tail is replaced with an infinite list of undefined elements. This example can properly distinguish between the not-so-correct tfold and proper VIP implementations (or other implementations that don't do unnecessary comparisons). Regards, Heinrich Apfelmus -- http://apfelmus.nfshost.com

Heinrich Apfelmus
Will Ness wrote:
Heinrich Apfelmus writes:
Here an example where the VIP merge would give a different result
bad = tfold $ (1:10:undefined) : (2:3:5:undefined) : (4:undefined) : error "bad"
The reason to *not* have the "lazy" patterns in foldTree for primes, as
Fischer discovered back then, is that they give it a space leak. Case in
Daniel point -
[...]
tfold ((x:xs):t) = x : xs `merge` tfold (pairs t) where pairs ((x:xs):ys:t) = (x : merge xs ys) : pairs t
hfold xs = serve . foldTree mergeP . map vip $ xs hfold' xs = serve . foldTree' mergeP . map vip $ xs
foldTree f ~(x:xs) = x `f` foldTree f (pairs xs) where pairs ~(x: ~(y:ys)) = f x y : pairs ys
foldTree' f (x:xs) = x `f` foldTree' f (pairs xs) where pairs (x: (y:ys)) = f x y : pairs ys
[...]
so hfold' too appears to be over-eager to-the-right, although still more productive than tfold.
Ah, the lazy patterns in foldTree are a different issue, sorry for my bad choice of example.
While I still don't understand the trade-off that turns the lazy patterns into a space leak, there is no harm in allowing foldTree to see the complete spine of the list. What we do want to forbid is looking at the elements of that list too early.
This turns out to be too optimistic a demand on data, in general.
In other words, the example should read
bad = tfold $ (1:10:undefined) : (2:3:5:undefined) : (4:undefined) : repeat (error "bad" : undefined)
i.e. the previously unknown tail is replaced with an infinite list of undefined elements. This example can properly distinguish between the not-so-correct tfold and proper VIP implementations (or other implementations that don't do unnecessary comparisons).
will have to think this over, but in the mean time, they *both* turn out to be *completely* and utterly *wrong* :) in a general case (although probably for different reasons). Here's where: *Main> mapM_ print $ take 5 $ map (take 10) [concatMap (replicate 3) [n,n+1..]|n<-[1..]] [1,1,1,2,2,2,3,3,3,4] [2,2,2,3,3,3,4,4,4,5] [3,3,3,4,4,4,5,5,5,6] [4,4,4,5,5,5,6,6,6,7] [5,5,5,6,6,6,7,7,7,8] *Main> take 20 $ hfold [concatMap (replicate 3) [n,n+1..]|n<-[1..]] [1,1,1,2,2,2,3,3,3,4,4,4,5,5,5,6,6,6,7,7] *Main> take 20 $ tfold [concatMap (replicate 3) [n,n+1..]|n<-[1..]] [1,1,1,2,2,2,3,3,3,4,4,4,5,5,5,6,6,6,7,7] when it should'a been [1,1,1,2,2,2,2,2,2,3,3,3,3,3,3,3,3,3,4,4] Cheers, :)
Regards, Heinrich Apfelmus

Will Ness
... they *both* turn out to be *completely* and utterly *wrong* :) in a general case (although probably for different reasons).
Sorry, my bad. Thought in terms of "merge", but the definiton used in VIP code was really an "union". When definition was changed to a real "merge", non-removing of duplicates, everything was as expected in that case, for both versions.
*Main> take 20 $ hfold [concatMap (replicate 3) [n,n+1..]|n<-[1..]] *Main> take 20 $ tfold [concatMap (replicate 3) [n,n+1..]|n<-[1..]]
[1,1,1,2,2,2,2,2,2,3,3,3,3,3,3,3,3,3,4,4]
participants (3)
-
Heinrich Apfelmus
-
Leon Smith
-
Will Ness