
With the urging and assistance of Omar Antolín Camarena, I will be adding two functions to data-ordlist: mergeAll and unionAll, which merge (or union) a potentially infinite list of potentially infinite ordered lists, under the assumption that the heads of the non-empty lists appear in a non-decreasing sequence. Union takes two sorted lists and produces a new sorted list; an element occurs in the result as many times as the maximum number of occurrences in either list. The unionAll function generalizes this behavior to an infinite number of lists. A reasonable implementation of mergeAll is:
import Data.List.Ordered(merge, union)
mergeAll :: Ord a => [[a]] -> [a] mergeAll = foldr (\(x:xs) ys -> x : merge xs ys) []
However, for many inputs, we can do better; the library implementation of mergeAll is based on H. Apfelmus's article "Implicit Heaps", which presents a simplification of Dave Bayer's "venturi" algorithm. The difference is that the foldr version uses a line of comparisons, whereas "venturi" uses a tree of comparisons. http://apfelmus.nfshost.com/articles/implicit-heaps.html http://www.mail-archive.com/haskell-cafe@haskell.org/msg27612.html However, as Omar pointed out to me, the following implementation of unionAll has a flaw:
unionAll :: Ord a => [[a]] -> [a] unionAll = foldr (\(x:xs) ys -> x : union xs ys) []
Namely unionAll [[1,2],[1,2]] should return [1,2], whereas it actually returns [1,1,2]. After some work, I believe I have generalized H. Apfelmus's algorithm to handle this; however it seems a bit complicated. I would love feedback, especially with regard to simplifications, bugs, testing strategies, and optimizations:
unionAll' :: Ord a => [[a]] -> [a] unionAll' = unionAllBy compare
data People a = VIP a (People a) | Crowd [a]
unionAllBy :: (a -> a -> Ordering) -> [[a]] -> [a] unionAllBy cmp xss = loop [ (VIP x (Crowd xs)) | (x:xs) <- xss ] where loop [] = [] loop ( VIP x xs : VIP y ys : xss ) = case cmp x y of LT -> x : loop ( xs : VIP y ys : xss ) EQ -> loop ( VIP x (union' xs ys) : unionPairs xss ) GT -> error "Data.List.Ordered.unionAll: assumption violated!" loop ( VIP x xs : xss ) = x : loop (xs:xss) loop [Crowd xs] = xs loop (xs:xss) = loop (unionPairs (xs:xss))
unionPairs [] = [] unionPairs [x] = [x] unionPairs (x:y:zs) = union' x y : unionPairs zs
union' (VIP x xs) (VIP y ys) = case cmp x y of LT -> VIP x (union' xs (VIP y ys)) EQ -> VIP x (union' xs ys) GT -> error "Data.List.Ordered.unionAll: assumption violated!" union' (VIP x xs) (Crowd ys) = VIP x (union' xs (Crowd ys)) union' (Crowd []) ys = ys union' (Crowd xs) (Crowd ys) = Crowd (unionBy cmp xs ys) union' xs@(Crowd (x:xt)) ys@(VIP y yt) = case cmp x y of LT -> VIP x (union' (Crowd xt) ys) EQ -> VIP x (union' (Crowd xt) yt) GT -> VIP y (union' xs yt)
-- Leon

Leon Smith wrote:
With the urging and assistance of Omar Antolín Camarena, I will be adding two functions to data-ordlist: mergeAll and unionAll, which merge (or union) a potentially infinite list of potentially infinite ordered lists, under the assumption that the heads of the non-empty lists appear in a non-decreasing sequence.
However, as Omar pointed out to me, the following implementation of unionAll has a flaw:
unionAll :: Ord a => [[a]] -> [a] unionAll = foldr (\(x:xs) ys -> x : union xs ys) []
Namely unionAll [[1,2],[1,2]] should return [1,2], whereas it actually returns [1,1,2]. After some work, I believe I have generalized H. Apfelmus's algorithm to handle this; however it seems a bit complicated. I would love feedback, especially with regard to simplifications, bugs, testing strategies, and optimizations:
unionAll' :: Ord a => [[a]] -> [a] unionAll' = unionAllBy compare
data People a = VIP a (People a) | Crowd [a]
unionAllBy :: (a -> a -> Ordering) -> [[a]] -> [a] unionAllBy cmp xss = loop [ (VIP x (Crowd xs)) | (x:xs) <- xss ] where loop [] = [] loop ( VIP x xs : VIP y ys : xss ) = case cmp x y of LT -> x : loop ( xs : VIP y ys : xss ) EQ -> loop ( VIP x (union' xs ys) : unionPairs xss ) GT -> error "Data.List.Ordered.unionAll: assumption violated!" loop ( VIP x xs : xss ) = x : loop (xs:xss) loop [Crowd xs] = xs loop (xs:xss) = loop (unionPairs (xs:xss))
unionPairs [] = [] unionPairs [x] = [x] unionPairs (x:y:zs) = union' x y : unionPairs zs
union' (VIP x xs) (VIP y ys) = case cmp x y of LT -> VIP x (union' xs (VIP y ys)) EQ -> VIP x (union' xs ys) GT -> error "Data.List.Ordered.unionAll: assumption violated!" union' (VIP x xs) (Crowd ys) = VIP x (union' xs (Crowd ys)) union' (Crowd []) ys = ys union' (Crowd xs) (Crowd ys) = Crowd (unionBy cmp xs ys) union' xs@(Crowd (x:xt)) ys@(VIP y yt) = case cmp x y of LT -> VIP x (union' (Crowd xt) ys) EQ -> VIP x (union' (Crowd xt) yt) GT -> VIP y (union' xs yt)
I see no obvious deficiencies. :) Personally, I'd probably structure it like http://www.haskell.org/haskellwiki/Prime_numbers#Implicit_Heap so that your code becomes unionAll = serve . foldTree union' . map vip Your loop function is a strange melange of many different concerns (building a tree, union', adding and removing the VIP constructors). Note that it's currently unclear to me whether the lazy pattern match in pairs ~(x: ~(y:ys)) = f x y : pairs ys is beneficial or not; you used a strict one unionPairs (x:y:zs) = union' x y : unionPairs zs Daniel Fischer's experiments suggest that the strict one is better http://www.mail-archive.com/haskell-cafe@haskell.org/msg69807.html If you're really concerned about time & space usage, it might even be worth to abandon the lazy tree altogether and use a heap to achieve the same effect, similar to Melissa O'Neils prime number code. It's not as "neat", but much more predictable. :) Regards, Heinrich Apfelmus -- http://apfelmus.nfshost.com

I see no obvious deficiencies. :) Personally, I'd probably structure it like
http://www.haskell.org/haskellwiki/Prime_numbers#Implicit_Heap
This variant, based on the wiki article, is cleaner, slightly simpler, appears to be just as fast, and allocates slightly less memory:
import GHC.Exts(inline) import Data.List.Ordered(unionBy)
union' :: People Int -> People Int -> People Int union' (VIP x xt) ys = VIP x (union' xt ys) union' (Crowd xs) (Crowd ys) = Crowd (inline unionBy compare xs ys) union' xs@(Crowd (x:xt)) ys@(VIP y yt) = case compare x y of LT -> VIP x (union' (Crowd xt) ys) EQ -> VIP x (union' (Crowd xt) yt) GT -> VIP y (union' xs yt)
foldTree :: (a -> a -> a) -> [a] -> a foldTree f xs = case xs of [] -> [] xs -> loop xs where loop [x] = x loop (x:xs) = x `f` loop (pairs xs)
pairs (x:y:ys) = f x y : pairs ys pairs xs = xs
unions xss = serve $ inline foldTree union' [ VIP x (Crowd xs) | (x:xs) <- xss ] where serve (VIP x xs) = x:serve xs serve (Crowd xs) = xs
One of the differences is that I started with a slightly different "foldTree", one that was taken directly from Data.List.sort. The only problem is that it has the same problem as I mentioned: unionAll [[1,2],[1,2]] == [1,1,2] whereas unionAll is intended to be a generalization of "foldr union []" to an infinite number of lists, and should thus return [1,2]. But I should be able to fix this without much difficulty.
Your loop function is a strange melange of many different concerns (building a tree, union', adding and removing the VIP constructors).
Note that it's currently unclear to me whether the lazy pattern match in
pairs ~(x: ~(y:ys)) = f x y : pairs ys
is beneficial or not; you used a strict one
unionPairs (x:y:zs) = union' x y : unionPairs zs
Well, as the library implementation must work on finite cases as well, the lazy pattern seems out of the question.
If you're really concerned about time & space usage, it might even be worth to abandon the lazy tree altogether and use a heap to achieve the same effect, similar to Melissa O'Neils prime number code. It's not as "neat", but much more predictable. :)
Well, it is intended as a high quality, generally useful implementation, so of course I care about time and space usage. :) Dave Bayer's original algorithm does slightly better, but was much larger in terms of both source code and object size. Omar implemented something along these lines, but it didn't perform so well. I did not dig into the reasons why, though; it might not have had anything to do with the fact an explicit heap was used. Incidentally, I tried implementing something like implicit heaps once upon a time; but it had a severe performance problem, taking a few minutes to produce 20-30 elements. I didn't have a pressing reason to figure out why though, and didn't pursue it further. Best, Leon

Leon Smith wrote:
Heinrich Apfelmus wrote:
I see no obvious deficiencies. :) Personally, I'd probably structure it like
http://www.haskell.org/haskellwiki/Prime_numbers#Implicit_Heap
This variant, based on the wiki article, is cleaner, slightly simpler, appears to be just as fast, and allocates slightly less memory:
import GHC.Exts(inline) import Data.List.Ordered(unionBy)
union' :: People Int -> People Int -> People Int union' (VIP x xt) ys = VIP x (union' xt ys) union' (Crowd xs) (Crowd ys) = Crowd (inline unionBy compare xs ys) union' xs@(Crowd (x:xt)) ys@(VIP y yt) = case compare x y of LT -> VIP x (union' (Crowd xt) ys) EQ -> VIP x (union' (Crowd xt) yt) GT -> VIP y (union' xs yt)
foldTree :: (a -> a -> a) -> [a] -> a foldTree f xs = case xs of [] -> [] xs -> loop xs where loop [x] = x loop (x:xs) = x `f` loop (pairs xs)
pairs (x:y:ys) = f x y : pairs ys pairs xs = xs
unions xss = serve $ inline foldTree union' [ VIP x (Crowd xs) | (x:xs) <- xss ] where serve (VIP x xs) = x:serve xs serve (Crowd xs) = xs
One of the differences is that I started with a slightly different "foldTree", one that was taken directly from Data.List.sort.
The only problem is that it has the same problem as I mentioned:
unionAll [[1,2],[1,2]] == [1,1,2]
whereas unionAll is intended to be a generalization of "foldr union []" to an infinite number of lists, and should thus return [1,2]. But I should be able to fix this without much difficulty.
Ah, I meant to use the union' from your previous message, but I think that doesn't work because it doesn't have the crucial property that the case union (VIP x xs) ys = ... does not pattern match on the second argument. The easiest solution is simply to define unionAll = nub . mergeAll where -- specialized definition of nub nub = map head . groupBy (==) But you're probably concerned that filtering for duplicates afterwards will be less efficient. After all, the (implicit) tree built by mergeAll might needlessly compare a lot of equal elements. Fortunately, it is straightforward to fuse nub into the tree merging: nub . serve . foldTree union' = serve . nubP . foldTree union' = serve . foldTree (nub' . union') with appropriate definitions of nubP and nub' . In particular, the definition -- remove duplicate VIPs nub' (Crowd xs) = Crowd xs nub' (VIP x xs) = VIP x (guard x xs) where guard x (VIP y ys) | x == y = nub' ys | otherwise = VIP y (guard y ys) guard x (Crowd (y:ys)) | x == y = Crowd ys | otherwise = Crowd (y:ys) takes advantage of the facts that * the left and right arguments of union' can now be assumed to not contain duplicates * crowds do not contain duplicates thanks to the call to unionBy Whether nub' saves more comparisons than it introduces is another question. If you want, you can probably fuse nub' and union' as well, but I guess the result won't be pretty.
Incidentally, I tried implementing something like implicit heaps once upon a time; but it had a severe performance problem, taking a few minutes to produce 20-30 elements. I didn't have a pressing reason to figure out why though, and didn't pursue it further.
Yeah, they're tricky to get right. One pattern match too strict and it's sucked into a black hole, two pattern matches too lazy and it will leak space like the big bang. :) Regards, Heinrich Apfelmus -- http://apfelmus.nfshost.com

The easiest solution is simply to define
unionAll = nub . mergeAll where -- specialized definition of nub nub = map head . groupBy (==)
Talking about the easiest solution, I guess this is a quite easy way of defining unionAll as well: http://gist.github.com/306782 I, of course, do not claim that it is more efficient or better. But I don't think it'd be rubbish :) -- Ozgur Akgun

Am Mittwoch 17 Februar 2010 17:46:38 schrieb Ozgur Akgun:
The easiest solution is simply to define
unionAll = nub . mergeAll where -- specialized definition of nub nub = map head . groupBy (==)
Talking about the easiest solution, I guess this is a quite easy way of defining unionAll as well: http://gist.github.com/306782 I, of course, do not claim that it is more efficient or better. But I don't think it'd be rubbish :)
let next = minimum (map head xs') doesn't work if you have infinitely many lists :(

Ooops I thought the inner lists are possibly of infinite size.
On 17 February 2010 17:16, Daniel Fischer
Am Mittwoch 17 Februar 2010 17:46:38 schrieb Ozgur Akgun:
The easiest solution is simply to define
unionAll = nub . mergeAll where -- specialized definition of nub nub = map head . groupBy (==)
Talking about the easiest solution, I guess this is a quite easy way of defining unionAll as well: http://gist.github.com/306782 I, of course, do not claim that it is more efficient or better. But I don't think it'd be rubbish :)
let next = minimum (map head xs')
doesn't work if you have infinitely many lists :( _______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
-- Ozgur Akgun

On Wed, Feb 17, 2010 at 6:58 AM, Heinrich Apfelmus
Ah, I meant to use the union' from your previous message, but I think that doesn't work because it doesn't have the crucial property that the case
union (VIP x xs) ys = ...
does not pattern match on the second argument.
Ahh yes, the funny thing is that I tested the code in my previous message, and it worked in the infinite case. Then I replaced the union' to pattern match on the second argument as well, and tested it on only finite cases, and then released it. Thus, unionAll in data-ordlist-0.4.1 doesn't work on an infinite number of lists. So my original unionAll in data-ordlist-0.4 appears to work ok, my revised and simplified unionAll doesn't work at all.
The easiest solution is simply to define
unionAll = nub . mergeAll where -- specialized definition of nub nub = map head . groupBy (==)
Incidentally, data-ordlist has a (slightly different) version of nub that does exactly what you want in this particular case. Check out the documentation for "nub" and "nubBy"
But you're probably concerned that filtering for duplicates afterwards will be less efficient. After all, the (implicit) tree built by mergeAll might needlessly compare a lot of equal elements.
Well, yes and no. Efficiency is good, but this implementation does
not match my intention. For example:
unionAll [[1,1,2,2,2],[1,1,1,2]] == foldr union [] [...] == [1,1,1,2,2,2]
The "union" function preserves strictly ascending lists, but it also
works on multisets as well, returning an element as many times as the
maximum number of times in either list. Thus, on an infinite number
of lists, unionAll should return a particular element as many times
as the maximum number of times it appears in any single list.
On Wed, Feb 17, 2010 at 1:18 PM, Daniel Fischer
Am Mittwoch 17 Februar 2010 18:59:42 schrieb Ozgur Akgun:
Ooops I thought the inner lists are possibly of infinite size.
Both, I think.
Yes, both the inner and outer lists of an input to unionAll might be infinite. It's just that foldr union [] works fine if the inner lists are infinite, but gets stuck in an infinite non-productive list if the outer list is infinite. Best, Leon

On Wed, Feb 17, 2010 at 6:58 AM, Heinrich Apfelmus
Ah, I meant to use the union' from your previous message, but I think that doesn't work because it doesn't have the crucial property that the case
union (VIP x xs) ys = ...
does not pattern match on the second argument.
Ahh yes, my original union' has a bit that looks like this union' (VIP x xs) (VIP y ys) = case cmp x y of LT -> VIP x (union' xs (VIP y ys)) EQ -> VIP x (union' xs ys) GT -> error "Data.List.Ordered.unionAll: assumption violated!" union' (VIP x xs) (Crowd ys) = VIP x (union' xs (Crowd ys)) For whatever reason, this works in the case of an infinite number of lists with my original version, but not the simplified version. By applying a standard transformation to make this lazier, we can rewrite these clauses as union' (VIP x xs) ys = VIP x $ case ys of Crowd _ -> union' xs ys VIP y yt -> case cmp x y of LT -> union' xs ys EQ -> union' xs yt GT -> error msg In the original case, we have this strictness property union' (VIP x xs) ⊥ == ⊥ The revised verison is a bit lazier: union' (VIP x xs) ⊥ == VIP x ⊥ And so the simplified unionAll now works again on an infinite number of lists. I've uploaded data-ordlist-0.4.2 to fix the bug introduced with data-ordlist-0.4.1, and added a regression test to the suite. Best, Leon

By purest coincidence I just wrote the exact same function (the simple mergeAll', not the VIP one). Well, extensionally the same... intensionally mine is 32 complicated lines and equivalent to the 3 line mergeAll'. I even thought of short solution by thinking that pulling the first element destroys the ascending lists property so it's equivalent to a normal sorted merge after that, and have no idea why I didn't just write it that way. Anyway, I'm dropping mine and downloading data-ordlist. Thanks for the library *and* the learning experience :)

BTW, I notice that your merges, like mine, are left-biased. This is a useful property (my callers require it), and doesn't seem to cost anything to implement, so maybe you could commit to it in the documentation? By left-biased I mean that when elements compare equal, pick the leftmost one, e.g. "mergeOn fst [(0, 'a')] [(0, 'b')] == [(0, 'a'), (0, 'b')]". And BTW again, here's something I've occasionally found useful: -- | Handy to merge or sort a descending list. reverse_compare :: (Ord a) => a -> a -> Ordering reverse_compare a b = case compare a b of LT -> GT EQ -> EQ GT -> LT

On Thu, Feb 18, 2010 at 8:07 AM, Evan Laforge
And BTW again, here's something I've occasionally found useful:
-- | Handy to merge or sort a descending list. reverse_compare :: (Ord a) => a -> a -> Ordering reverse_compare a b = case compare a b of LT -> GT EQ -> EQ GT -> LT
I wondered why there wasn't one of these in the standard library until someone pointed out to me that reverse_compare = flip compare which actually takes fewer characters to type :P

On Thu, Feb 18, 2010 at 3:07 AM, Evan Laforge
BTW, I notice that your merges, like mine, are left-biased. This is a useful property (my callers require it), and doesn't seem to cost anything to implement, so maybe you could commit to it in the documentation?
Also, I did briefly consider giving up left bias. GHC has an optimization strategy that seeks to reduce pattern matching, and due to interactions with this I could have saved a few kilobytes of -O2 object code size by giving up left-bias. For example: module MergeLeft where mergeBy :: (a -> a -> Ordering) -> [a] -> [a] -> [a] mergeBy cmp = loop where loop [] ys = ys loop xs [] = xs loop (x:xs) (y:ys) = case cmp x y of GT -> y : loop (x:xs) ys _ -> x : loop xs (y:ys) compiles ghc-6.12.1 -O2 to a 4208 byte object file for x64 ELF. By changing the very last line to: _ -> x : loop (y:ys) xs I get a 3336 byte object file instead, but of course this is no longer left- (or right-) biased. Repeat this strategy across the entire module, and you can save 3 kilobytes or so. However, in today's modern computing environment, left-bias is clearly a greater benefit to more people. If you are curious why, I suggest taking a look at GHC's core output for each of these two variants. The hackage package "ghc-core" makes this a little bit more pleasant, as it can pretty-print it for you. It's amazing to think that this library, at 55k (Optimized -O2 for x64), would take up most of the memory of my very first computer, a Commodore 64. Of course, I'm sure there are many others on this list who's first computers had a small fraction of 64k of memory to play with. :-)

On Thu, Feb 18, 2010 at 5:22 PM, Leon Smith
On Thu, Feb 18, 2010 at 3:07 AM, Evan Laforge
wrote: BTW, I notice that your merges, like mine, are left-biased. This is a useful property (my callers require it), and doesn't seem to cost anything to implement, so maybe you could commit to it in the documentation?
Ohh, I see it now, I just wasn't looking at the module doc.
Also, I did briefly consider giving up left bias. GHC has an optimization strategy that seeks to reduce pattern matching, and due to interactions with this I could have saved a few kilobytes of -O2 object code size by giving up left-bias.
Interesting... but left bias is so useful I think it's worth a few extra k.
If you are curious why, I suggest taking a look at GHC's core output for each of these two variants. The hackage package "ghc-core" makes this a little bit more pleasant, as it can pretty-print it for you.
I can see there's one extra case in the first one, and I can tell the last case is the 'loop' case including the case on Ordering, but I admit I don't understand what the previous cases are doing. Core is really hard for me to read.
It's amazing to think that this library, at 55k (Optimized -O2 for x64), would take up most of the memory of my very first computer, a Commodore 64. Of course, I'm sure there are many others on this list who's first computers had a small fraction of 64k of memory to play with. :-)
It's not even that much assembly. I intended to write a small quick program... then I did it in haskell, and then I linked in the GHC API (fatal blow). Now the stripped optimized binary is 22MB (optimization doesn't seem to have an effect on size). The non-haskell UI part is 367k...

On Thu, Feb 18, 2010 at 2:32 AM, Evan Laforge
By purest coincidence I just wrote the exact same function (the simple mergeAll', not the VIP one). Well, extensionally the same... intensionally mine is 32 complicated lines and equivalent to the 3 line mergeAll'. I even thought of short solution by thinking that pulling the first element destroys the ascending lists property so it's equivalent to a normal sorted merge after that, and have no idea why I didn't just write it that way.
Well, the three line version wasn't my first implementation, by any stretch of the imagination. I know I had tried to implement mergeAll at least once, if not two or three times before coming up with the foldr-based implementation. However, I can't find any of them; they may well be lost to the sands of time. Incidentally, that implementation also appears in Melissa O'Neill's "Genuine Sieve of Eratosthenes", in an alternate prime sieve by Richard Bird that appears at the end.
Anyway, I'm dropping mine and downloading data-ordlist. Thanks for the library *and* the learning experience :)
Thanks!
BTW, I notice that your merges, like mine, are left-biased. This is a useful property (my callers require it), and doesn't seem to cost anything to implement, so maybe you could commit to it in the documentation?
Yes, the description of the module explicitly states that all functions are left-biased; if you have suggestions about how to improve the documentation in content or organization, I am interested in hearing them. Best, Leon

Leon Smith wrote:
On Wed, Feb 17, 2010 at 6:58 AM, Heinrich Apfelmus
wrote: Ah, I meant to use the union' from your previous message, but I think that doesn't work because it doesn't have the crucial property that the case
union (VIP x xs) ys = ...
does not pattern match on the second argument.
Ahh yes, my original union' has a bit that looks like this
union' (VIP x xs) (VIP y ys) = case cmp x y of LT -> VIP x (union' xs (VIP y ys)) EQ -> VIP x (union' xs ys) GT -> error "Data.List.Ordered.unionAll: assumption violated!" union' (VIP x xs) (Crowd ys) = VIP x (union' xs (Crowd ys))
For whatever reason, this works in the case of an infinite number of lists with my original version, but not the simplified version. By applying a standard transformation to make this lazier, we can rewrite these clauses as
union' (VIP x xs) ys = VIP x $ case ys of Crowd _ -> union' xs ys VIP y yt -> case cmp x y of LT -> union' xs ys EQ -> union' xs yt GT -> error msg
Oops, I missed this simple rewrite, mainly because the GT case did not start with the VIP x constructor. :D Regards, Heinrich Apfelmus -- http://apfelmus.nfshost.com
participants (6)
-
Ben Millwood
-
Daniel Fischer
-
Evan Laforge
-
Heinrich Apfelmus
-
Leon Smith
-
Ozgur Akgun