[Haskell-cafe] Space Efficiency When Sorting a List of Many Lists

4 Jan 2010

      Heinrich, thanks for some great help and food for thought!

(More on performance of your solution below)
...
Thanks for describing the background of this problem in detail! I was
mainly asking because I'm always looking for interesting Haskell  
topics
that can be turned into a tutorial of sorts, and this problem makes a
great example.
Another interesting problem is starting from a file of single wagers  
and trying to compress them  (i.e. inverse of 'explosion'. I believe  
this problem is analogous to Set Cover and therefore NP-Complete.  
Heuristics are the order of the day here.

(OT, but might also make an interesting multi-part tutorial question/ 
topic:
A simpler (fairly unrelated problem) is doing exact k-means clustering  
of a list of n items (i.e. single dimension data) via the k- 
segmentations of a (sorted by values one is clustering on) list of n  
items. The number of k-segmentations of an n-list is (n-1) Choose  
(k-1). One generates all the k-segmentations and each set of k- 
segments constitutes a candidate clustering solution. One selects the  
candidate solution which minimises the 'goodness of fit criterion'  
used in k-means clustering (easily found in the literature). Needless  
to say, this is only feasible for smallish n and for values of k  
closer to the LHS and RHS of Pascal's triangle. K-means clustering in  
more than one dimension is NP-Complete.)
...
Concerning optimization, the background also reveals some additional
informations, namely
* The single wager combinations are short lists
* and each list element is small, i.e. `elem` [1..20]
* No duplicate single wagers
* You are free to choose another ordering than lexicographic
...
I need to explode these compressed wagers back to single combinations
because, I need to (eventually) do set comparisons on collections  of
single combinations. i.e. given File A and File B of compressed  
wagers,
I need to be able to  answer questions like:
(1) How many single combinations are common to File A and File B
(2) How many single combinations in File A are missing from File B
..
..
etc. i.e a few of the common set comparison operations.
And the dumbest, quickest and dirtiest and most idiot-proof way of  
doing
this as a baseline approach before I start using maps, tries,  
etc... is
to explode Files A and B into lex sorted single combinations order  
and
then use diff -y with judicious grepping for '>' and <''
Looks good to me, sorted lists are a fine data structure for set  
operations.
...
I'm probably going to go with an external sort for starters to keep
memory usage down. For most use-cases, I can get away with my current
in-memory sort - just have to beware edge cases.
Using an out-of-the box external sort library is probably the path of
least resistance, provided the library works as it should.
Seems to be some issue with the external-sort package. As it currently  
exists, it blows up on my [[Int]] data and exhibits more memory usage  
than vanilla sort! I'm keen to look into fixing this as might be one  
small way I can give back, but still taking baby steps with Haskell.
...
However, your problem has a very special structure; you can interleave
explode  and  sort  and turn the algorithm into one that is partially
on-line, which will save you a lot of memory. I've tried to show how  
to
do this in my previous post; the resulting code will look something  
like
this
{-# LANGUAGE NoMonomorphismRestriction #-}
   import qualified Data.Map
   import Control.Arrow (second)
newtype Map a b = Map { unMap :: [(a,b)] } deriving (Eq, Show)
instance Functor (Map a) where
       fmap f = Map . (map . second) f . unMap
hylo :: (Map a b -> b) -> (c -> Map a c) -> (c -> b)
   hylo f g = f . fmap (hylo f g) . g
type Row a = [a]
headsOut :: Map a [Row a] -> [Row a]
   headsOut (Map []) = [[]]
   headsOut m        = [x:row | (x,rows) <- unMap m, row <- rows]
explode1 :: [Row [a]] -> Map a [Row [a]]
   explode1 rows = Map [(x,[row]) | (xs:row) <- rows, x <- xs]
sort1 :: Ord a => Map a [b] -> Map a [b]
   sort1 = Map . Data.Map.toList . Data.Map.fromListWith (++) . unMap
sortExplode = hylo headsOut (sort1 . explode1)
example = [[[8,12],[11],[7,13],[10]], [[1,2,3],[1,9],[3,6],[4]]]
   test    = sortExplode example
main = interact (unlines . toCSV . sortExplode . fromCSV . lines)
In other words,  sortExplode  will "stream" the sorted single wagers  
on
demand, unlike the monolithic  sort . explode  which has to produce  
all
single wagers before sorting them.
Thank you *very* much for this code! I will try to get my head around  
it. I understand the broad outline you posted elsewhere, but will take  
me a while to fully grasp the above as I'm only up to ~p200 in Real  
World Haskell :).

As for performance of your code above on my file of compressed wagers  
which expands to 3.7M single wagers:

(My original version posted here and using vanilla sort)
541 MB total memory in use (5 MB lost due to fragmentation)
INIT  time    0.00s  (  0.01s elapsed)
   MUT   time   13.98s  ( 15.82s elapsed)
   GC    time    8.69s  (  9.64s elapsed)
   EXIT  time    0.00s  (  0.00s elapsed)
   Total time   22.67s  ( 25.47s elapsed)

(Your much improved version)
10 MB total memory in use (1 MB lost due to fragmentation)
   INIT  time    0.00s  (  0.00s elapsed)
   MUT   time    7.61s  (  9.38s elapsed)
   GC    time    3.48s  (  3.58s elapsed)
   EXIT  time    0.00s  (  0.00s elapsed)
   Total time   11.08s  ( 12.95s elapsed)

Very impressive and thanks again!
...
...
However, later on, I'm keen to do everything with set theoretic
operations and avoid writing large files of single combinations to  
disk.
Note that there's also the possibility of not expanding the compressed
wagers at all, and perform set operations directly. For instance, it  
is
straightforward to intersect two such sets of size n in O(n^2) time.
Since n ~ 5000 , n^2 is about the same ballpark as the exploded single
wager combinations.
Not sure I quite understand you here. In my mind, set elements *are*  
single combinations. It is possible for two quite different-looking  
files of compressed wagers to contain exactly the same single wager  
elements. So I'm not sure how to compare without some notion of  
explosion to single combinations.
...
...
So, I guess the things I'd like to get a feel for are:
(1) Is it realistic to expect to do in-memory set comparisons on sets
with ~1M elements where each element is a (however-encoded) list of
(say) 6 integers? I mean realistic in terms of execution time and  
space
usage, of course.
(2) What would be a good space-efficient encoding for these single
combinations? I know that there will never be more than 8 integers  
in a
combination, and none of these values will be < 1 or > 20. So  
perhaps I
should map them to ByteString library Word8 strings? Presumably  
sorting
a big list of ByteStrings is going to be faster than sorting a big  
list
of lists of int?
The overhead for lists and sets are quite high, it's not uncommon  
for 1M
elements to occupy 10M-100M of memory.
Not to mention that your elements are, in fact, small lists,  
introducing
another factor of 10-100. But this can indeed be ameliorated
considerably by representing your elements in a packed format like
ByteStrings or a Word64.
I haven't tried packing ByteStrings yet in this context, but *did* get  
some improvement in external-sort performance when using ByteStrings  
instead of [[Int]]. Presumably because it somewhat ameliorated the  
effects of a bug in external-sort where blocks are fully compared  
rather than just the heads of blocks.

I'll be interested to play around with your version more and see if it  
benefits much from ByteString. It's already very good now in terms of  
space usage, but perhaps can speed up the I/O (output dominates, of  
course).

I'll be learning from your post for a long time. So thanks again!

[Haskell-cafe] Space Efficiency When Sorting a List of Many Lists

Peter Green