Re: [Haskell-beginners] Unexpected Space Leak (despite using external-sort)

4 Jan 2010

      Peter,

See below...
...
Thanks Steve. I think I understand your point. It seems like
external-sort needs patching to make block comparison more efficient for
everyone and that my case is just particularly extreme/degenerate. Just
right now, I'm not yet confident enough about my Haskell abilities to go
and create a new type for 'block' with more efficient Ord and then
modify the rest of the external-sort code accordingly. Posted program
was the 2nd Haskell prog I've every written. I'll get there in time,
though :).
Laziness and space analysis make up a significant part of Haskell's
learning curve, I believe, so you're doing well.

Here's how I'm suggesting you change kMerge
...
kMerge :: (Ord a) => [[a]] -> [a]
kMerge [] = []
kMerge l =
    let h = Splay.fromSeq l in
    kM (Splay.minElem h) (Splay.deleteMin h)
    where
    kM :: (Ord a) => [a] -> Splay.Heap [a] -> [a]
    kM l h
        | h == Splay.empty = l
        | otherwise =
            let next = Splay.minElem h
                (f, b) = span (\x -> x <= head next) l
            in
            f ++ (kM next (if null b then Splay.deleteMin h
                           else (Splay.insert b $ Splay.deleteMin h)))
becomes something like... (with lots of wrapping and unwrapping)...
...
newtype HeadCompList a = HeadCompList [a]
instance Ord a => Ord (HeadCompList a) where
    HeadCompList (x:_) `compare` HeadCompList (y:_) = x `compare` y
-- compare head elts only
kMerge :: (Ord a) => [[a]] -> [a]
kMerge [] = []
kMerge l =
    let wrappedL = map HeadCompList l
         h = Splay.fromSeq wrappedL in
    kM (Splay.minElem h) (Splay.deleteMin h)
    where
    kM :: (Ord a) => HeadCompList a -> Splay.Heap (HeadCompList a) -> [a]
    kM (HeadCompList l) h
        | h == Splay.empty = l
        | otherwise =
            let (HeadCompList next) = Splay.minElem h
                (f, b) = span (\x -> x <= head next) l
            in
            f ++ (kM next (if null b then Splay.deleteMin h
                           else (Splay.insert (HeadCompList b) $
Splay.deleteMin h)))
...
I wonder if there is another way of making my code work better with the
existing externalSort function:
I am sorting [[Int]], an example of which might be:
[[8, 11, 7, 10],
[8, 11, 13, 10],
[12, 11, 7, 10],
[12, 11, 13, 10],
[1, 1, 3, 4],
.
.
[3, 9, 6, 4]]
I need numerical lex sorting, so I can't just map directly to ByteString
as this would result in "12" collating before "1", etc.
I know that each sublist I am sorting never contains an integer greater
than 20. I could encode each sublist as a ByteString where 1 maps to a,
2 maps to B, 3, maps to C, etc.
So would have [1,1,3,4] -> "aacd" (ByteString), etc.
Hopefully Ord comparisons between ByteStrings are much less expensive
than comparisons between instances of  [Int] and I will see nicer
behaviour since external-sort does work as advertised on lists of
millions of Ints?
Well, if I'm right, then the problem is that kMerge is evaluating too
much of each of its ... in your case .. [[Int]] values.  You're
proposing to change it to [ByteString].  Again, if I'm right, that would
improve it by exactly the factor of space usage by ByteString / [Int],
without solving the fundamental problem.
...
So I could sort these, then map back to underlying Int data when I
output. Schwartzian transform type stuff.
Anyway, I'll give it a go.
I may be wrong!

Steve

Re: [Haskell-beginners] Unexpected Space Leak (despite using external-sort)

Stephen Blackheath [to Haskell-Beginners]