
Peter, See below...
Thanks Steve. I think I understand your point. It seems like external-sort needs patching to make block comparison more efficient for everyone and that my case is just particularly extreme/degenerate. Just right now, I'm not yet confident enough about my Haskell abilities to go and create a new type for 'block' with more efficient Ord and then modify the rest of the external-sort code accordingly. Posted program was the 2nd Haskell prog I've every written. I'll get there in time, though :).
Laziness and space analysis make up a significant part of Haskell's learning curve, I believe, so you're doing well. Here's how I'm suggesting you change kMerge
kMerge :: (Ord a) => [[a]] -> [a] kMerge [] = [] kMerge l = let h = Splay.fromSeq l in kM (Splay.minElem h) (Splay.deleteMin h) where kM :: (Ord a) => [a] -> Splay.Heap [a] -> [a] kM l h | h == Splay.empty = l | otherwise = let next = Splay.minElem h (f, b) = span (\x -> x <= head next) l in f ++ (kM next (if null b then Splay.deleteMin h else (Splay.insert b $ Splay.deleteMin h)))
becomes something like... (with lots of wrapping and unwrapping)...
newtype HeadCompList a = HeadCompList [a]
instance Ord a => Ord (HeadCompList a) where HeadCompList (x:_) `compare` HeadCompList (y:_) = x `compare` y -- compare head elts only
kMerge :: (Ord a) => [[a]] -> [a] kMerge [] = [] kMerge l = let wrappedL = map HeadCompList l h = Splay.fromSeq wrappedL in kM (Splay.minElem h) (Splay.deleteMin h) where kM :: (Ord a) => HeadCompList a -> Splay.Heap (HeadCompList a) -> [a] kM (HeadCompList l) h | h == Splay.empty = l | otherwise = let (HeadCompList next) = Splay.minElem h (f, b) = span (\x -> x <= head next) l in f ++ (kM next (if null b then Splay.deleteMin h else (Splay.insert (HeadCompList b) $ Splay.deleteMin h)))
I wonder if there is another way of making my code work better with the existing externalSort function:
I am sorting [[Int]], an example of which might be:
[[8, 11, 7, 10], [8, 11, 13, 10], [12, 11, 7, 10], [12, 11, 13, 10], [1, 1, 3, 4], . . [3, 9, 6, 4]]
I need numerical lex sorting, so I can't just map directly to ByteString as this would result in "12" collating before "1", etc.
I know that each sublist I am sorting never contains an integer greater than 20. I could encode each sublist as a ByteString where 1 maps to a, 2 maps to B, 3, maps to C, etc.
So would have [1,1,3,4] -> "aacd" (ByteString), etc.
Hopefully Ord comparisons between ByteStrings are much less expensive than comparisons between instances of [Int] and I will see nicer behaviour since external-sort does work as advertised on lists of millions of Ints?
Well, if I'm right, then the problem is that kMerge is evaluating too much of each of its ... in your case .. [[Int]] values. You're proposing to change it to [ByteString]. Again, if I'm right, that would improve it by exactly the factor of space usage by ByteString / [Int], without solving the fundamental problem.
So I could sort these, then map back to underlying Int data when I output. Schwartzian transform type stuff.
Anyway, I'll give it a go.
I may be wrong! Steve