[Haskell-cafe] Re: Space Efficiency When Sorting a List of Many Lists

4 Jan 2010

      Peter Green wrote:
...
Luke Palmer wrote:
...
Yes, by a lot.  Sorting requires keeping the entire list in memory.
And Haskell lists, unfortunately, are not that cheap in terms of space
usage (I think [Int] uses 3 words per element).
...
but my questions are:
       (1) Am I doing anything terribly stupid/naive here?
       (2) If so, what can I do to improve space efficiency?
I think I should re-state the problem and provide more background info.
In answer to the poster who suggested using a Trie, there *is*
a real world application for what I'm attempting to do.
I'll also hint at that below:
{ Begin Aside on Why I Would Want to do Such a Silly Thing:
'Compressed lists' are in fact compressed wager combinations for
multi-leg exotic horse racing events. 'Exploded lists' are the single
wager combinations covered by the grouped combinations.
[...]
Thanks for describing the background of this problem in detail! I was
mainly asking because I'm always looking for interesting Haskell topics
that can be turned into a tutorial of sorts, and this problem makes a
great example.

Concerning optimization, the background also reveals some additional
informations, namely

* The single wager combinations are short lists
* and each list element is small, i.e. `elem` [1..20]
* No duplicate single wagers
* You are free to choose another ordering than lexicographic
...
I need to explode these compressed wagers back to single combinations
because, I need to (eventually) do set comparisons on collections  of
single combinations. i.e. given File A and File B of compressed wagers,
I need to be able to  answer questions like:
(1) How many single combinations are common to File A and File B
(2) How many single combinations in File A are missing from File B
..
..
etc. i.e a few of the common set comparison operations.
And the dumbest, quickest and dirtiest and most idiot-proof way of doing
this as a baseline approach before I start using maps, tries, etc... is
to explode Files A and B into lex sorted single combinations order and
then use diff -y with judicious grepping for '>' and <''
Looks good to me, sorted lists are a fine data structure for set operations.
...
I'm probably going to go with an external sort for starters to keep
memory usage down. For most use-cases, I can get away with my current
in-memory sort - just have to beware edge cases.
Using an out-of-the box external sort library is probably the path of
least resistance, provided the library works as it should.

However, your problem has a very special structure; you can interleave
explode  and  sort  and turn the algorithm into one that is partially
on-line, which will save you a lot of memory. I've tried to show how to
do this in my previous post; the resulting code will look something like
this

    {-# LANGUAGE NoMonomorphismRestriction #-}
    import qualified Data.Map
    import Control.Arrow (second)

    newtype Map a b = Map { unMap :: [(a,b)] } deriving (Eq, Show)

    instance Functor (Map a) where
        fmap f = Map . (map . second) f . unMap

    hylo :: (Map a b -> b) -> (c -> Map a c) -> (c -> b)
    hylo f g = f . fmap (hylo f g) . g

    type Row a = [a]

    headsOut :: Map a [Row a] -> [Row a]
    headsOut (Map []) = [[]]
    headsOut m        = [x:row | (x,rows) <- unMap m, row <- rows]

    explode1 :: [Row [a]] -> Map a [Row [a]]
    explode1 rows = Map [(x,[row]) | (xs:row) <- rows, x <- xs]

    sort1 :: Ord a => Map a [b] -> Map a [b]
    sort1 = Map . Data.Map.toList . Data.Map.fromListWith (++) . unMap

    sortExplode = hylo headsOut (sort1 . explode1)

    example = [[[8,12],[11],[7,13],[10]], [[1,2,3],[1,9],[3,6],[4]]]
    test    = sortExplode example

    main = interact (unlines . toCSV . sortExplode . fromCSV . lines)

In other words,  sortExplode  will "stream" the sorted single wagers on
demand, unlike the monolithic  sort . explode  which has to produce all
single wagers before sorting them.
...
However, later on, I'm keen to do everything with set theoretic
operations and avoid writing large files of single combinations to disk.
Note that there's also the possibility of not expanding the compressed
wagers at all, and perform set operations directly. For instance, it is
straightforward to intersect two such sets of size n in O(n^2) time.
Since n ~ 5000 , n^2 is about the same ballpark as the exploded single
wager combinations.
...
So, I guess the things I'd like to get a feel for are:
(1) Is it realistic to expect to do in-memory set comparisons on sets
with ~1M elements where each element is a (however-encoded) list of
(say) 6 integers? I mean realistic in terms of execution time and space
usage, of course.
(2) What would be a good space-efficient encoding for these single
combinations? I know that there will never be more than 8 integers in a
combination, and none of these values will be < 1 or > 20. So perhaps I
should map them to ByteString library Word8 strings? Presumably sorting
a big list of ByteStrings is going to be faster than sorting a big list
of lists of int?
The overhead for lists and sets are quite high, it's not uncommon for 1M
elements to occupy 10M-100M of memory.

Not to mention that your elements are, in fact, small lists, introducing
another factor of 10-100. But this can indeed be ameliorated
considerably by representing your elements in a packed format like
ByteStrings or a Word64.

Regards,
Heinrich Apfelmus

--
http://apfelmus.nfshost.com