Re: [Haskell-cafe] Space Efficiency When Sorting a List of Many Lists

2 Jan 2010

      On 31/12/2009, at 6:38 PM, Luke Palmer wrote:
...
On Wed, Dec 30, 2009 at 9:39 PM, Peter Green  wrote:
...
I can guess that there might be be less laziness and more  
instantiation when
 sorting is introduced,
Yes, by a lot.  Sorting requires keeping the entire list in memory.
And Haskell lists, unfortunately, are not that cheap in terms of space
usage (I think [Int] uses 3 words per element).
...
but my questions are:
       (1) Am I doing anything terribly stupid/naive here?
       (2) If so, what can I do to improve space efficiency?
TIA!
import Data.List (sort)
import Data.List.Split (splitOn)
-- Cartesian Product over a List of Lists
-- cp [[1,2],[3],[4,5,6]] -->
[[1,3,4],[1,3,5],[1,3,6],[2,3,4],[2,3,5],[2,3,6]]
cp          :: [[a]] -> [[a]]
cp []       =  [[]]
cp (xs:xss) =  [y:ys | y <- xs, ys <- cp xss]
This cartesian product varies in its tail faster than its head, so
every head gets its own unique tail.  If you reverse the order of the
bindings so that it varies in its head faster, then tails are shared.
If my quick and dirty reasoning is correct, it improves the space
usage by a factor of the number of sublists.
cp' [] = [[]]
cp' (xs:xss) = [y:ys | ys <- cp' xss, y <- xs]
But if you're serious, you can probably do better than just generating
them all and passing them to sort.  I get the impression that there is
some structure here that can be taken advantage of.
...
-- fromCSV ["8+12,11,7+13,10", "1+2+3,1+9,3+6,4"] -->
-- [[[8,12],[11],[7,13],[10]],[[1,2,3],[1,9],[3,6],[4]]]
fromCSV :: [String] -> [[[Int]]]
fromCSV = map parseOneLine
   where parseOneLine = map parseGroup . splitOn ","
             where parseGroup = map read . splitOn "+"
-- explode [[[1,2],[3],[4,5,6]], [[1, 2], [14,15], [16]]] -->
[[1,3,4],[1,3,5],
-- [1,3,6],[2,3,4],[2,3,5],[2,3,6],[1,14,16],[1,15,16],[2,14,16], 
[2,15,16]]
explode :: [[[a]]] -> [[a]]
explode =  concatMap cp
-- toCSV [[8,11,7,10,12],[8,11,7,10,12],[8,11,7,10,12]] -->
-- ["8,11,7,10,12","8,11,7,10,12","8,11,7,10,12"]
toCSV :: (Show a) => [[a]] -> [String]
toCSV = map $ tail . init . show
--toCSV = map (intercalate "," . map show)
main = interact (unlines . toCSV . sort . explode . fromCSV . lines)
Thank you everyone for the very helpful suggestions so far!

I think I should re-state the problem and provide more background  
info. In answer to the poster who suggested using a Trie, there *is* a  
real world application for what I'm attempting to do. I'll also hint  
at that below:

I have text files containing 'compressed lists' Compressed lists
look like this:

8+12,11,7+13,10
1+2+3,1+9,3+6,4
.
.

Sublists are comma-delimited, and sublist elements are separated by
'+' character. Sublists contain integers in the range 1..20.

I parse these to look like so:

[[[8,12],[11],[7,13],[10]],
   [[1,2,3],[1,9],[3,6],[4]],
.
.
]

I need to explode these and produce a lex-sorted list of exploded lists:

[[1,1,3,4],[1,1,6,4],[1,9,3,4],[1,9,6,4],[2,1,3,4],[2,1,6,4],[2,9,3,4],
[2,9,6,4],[3,1,3,4],[3,1,6,4],[3,9,3,4],[3,9,6,4],
   [8,11,7,10],[8,11,13,10],[12,11,7,10],[12,11,13,10]]

I then output this data as comma-delimited lists:

1,1,3,4
1,1,6,4
.
.
12,11,13,10

It is a property of the underlying problem 'structure' that none of  
these comma-delimited lists in the exploded data is repeated. i.e. we  
can never see two instances of (say) 1,1,3,4.

{ Begin Aside on Why I Would Want to do Such a Silly Thing:

'Compressed lists' are in fact compressed wager combinations for multi- 
leg exotic horse racing events. 'Exploded lists' are the single wager  
combinations covered by the grouped combinations.

Why use compressed combinations? Well, it's a lot easier to submit  
5,000 compressed tickets (whether physically or electronically) than  
1M single tickets for the individual combinations we have somehow  
managed to represent by the 5,000 compressed tickets.

However, single combinations are the currency of the realm. These are  
the underlying wagers and compression is merely a convenience/ 
necessity to enable single wagers to be placed.

It's  not uncommon to generate large numbers of single combinations:  
Imagine 6 races with 15 runners in each race, and a wager requiring  
one to select first place getters in all 6 legs. The number of  
potential outcomes is 15^6 = 11,390,625. One might decide (somehow)  
that it makes sense in a positive expectation way to wager (say) 500K  
of these potential outcomes.

Getting from theoretically desirable single combinations to optimally  
compressed wagers is actually NP-Complete and another story for  
another day. Suffice it to say that one might use some nasty hacks and  
heuristics to arrive at compressed wagers - but in the process doing  
violence to one's precise coverage of the theoretically desirable  
single combinations. That's one reason why I need to recover (by  
'exploding') the *actual* wagered single combinations from the  
compressed/ wagers.

I need to explode these compressed wagers back to single combinations  
because, I need to (eventually) do set comparisons on collections  of  
single combinations. i.e. given File A and File B of compressed  
wagers, I need to be able to  answer questions like:

(1) How many single combinations are common to File A and File B
(2) How many single combinations in File A are missing from File B
.
.
etc. i.e a few of the common set comparison operations.

And the dumbest, quickest and dirtiest and most idiot-proof way of  
doing this as a baseline approach before I start using maps, tries,  
etc... is to explode Files A and B into lex sorted single combinations  
order and then use diff -y with judicious grepping for '>' and <''

End Aside }

I'm probably going to go with an external sort for starters to keep  
memory usage down. For most use-cases, I can get away with my current  
in-memory sort - just have to beware edge cases.

However, later on, I'm keen to do everything with set theoretic  
operations and avoid writing large files of single combinations to disk.

So, I guess the things I'd like to get a feel for are:

(1) Is it realistic to expect to do in-memory set comparisons on sets  
with ~1M elements where each element is a (however-encoded) list of  
(say) 6 integers? I mean realistic in terms of execution time and  
space usage, of course.

(2) What would be a good space-efficient encoding for these single  
combinations? I know that there will never be more than 8 integers in  
a combination, and none of these values will be < 1 or > 20. So  
perhaps I should map them to ByteString library Word8 strings?  
Presumably sorting a big list of ByteStrings is going to be faster  
than sorting a big list of lists of int?