[Haskell-cafe] Space Efficiency When Sorting a List of Many Lists

31 Dec 2009

      I'm a Haskell neophyte, so may be missing something obvious in the  
problem outlined below. I'm fairly proficient in Python + have some  
limited experience in OCaml and F#, so know just enough to be be  
dangerous, but not nearly enough to really know what I'm doing here.

OK, I have text files containing 'compressed lists' Compressed lists  
look like this:

8+12,11,7+13,10
1+2+3,1+9,3+6,4
.
.

Sublists are comma-delimited, and sublist elements are separated by  
'+' character.

I parse these to look like so:

[[[8,12],[11],[7,13],[10]],
  [[1,2,3],[1,9],[3,6],[4]],
.
.
]

I need to explode these and produce a lex-sorted list of exploded lists:

[[1,1,3,4],[1,1,6,4],[1,9,3,4],[1,9,6,4],[2,1,3,4],[2,1,6,4],[2,9,3,4], 
[2,9,6,4],[3,1,3,4],[3,1,6,4],[3,9,3,4],[3,9,6,4],
  [8,11,7,10],[8,11,13,10],[12,11,7,10],[12,11,13,10]]

I then output this data as comma-delimited lists:

1,1,3,4
1,1,6,4
.
.
12,11,13,10

I can do all this in my (no doubt fairly naive) program shown below.  
In real life, one of my test data files contains compressed lists  
which all contain 7 sublists. This data (correctly) explodes to a list  
containing ~3,700,000 exploded lists. All good as far as correctly  
transforming input to output goes. However, sorting the data uses a  
lot of memory:

Partial output from ./explode +RTS -sstderr:

540 MB total memory in use (4 MB lost due to fragmentation)

If I do not do any sorting on the exploded lists, i.e. modify the main  
function to be
	main = interact (unlines . toCSV . explode . fromCSV . lines)

I then see this partial output from ./explode +RTS --stderr:

2 MB total memory in use (0 MB lost due to fragmentation)

I can guess that there might be be less laziness and more  
instantiation when  sorting is introduced, but my questions are:
	(1) Am I doing anything terribly stupid/naive here?
	(2) If so, what can I do to improve space efficiency?

TIA!

import Data.List (sort)
import Data.List.Split (splitOn)

-- Cartesian Product over a List of Lists
-- cp [[1,2],[3],[4,5,6]] --> [[1,3,4],[1,3,5],[1,3,6],[2,3,4],[2,3,5], 
[2,3,6]]
cp          :: [[a]] -> [[a]]
cp []       =  [[]]
cp (xs:xss) =  [y:ys | y <- xs, ys <- cp xss]

-- fromCSV ["8+12,11,7+13,10", "1+2+3,1+9,3+6,4"] -->
-- [[[8,12],[11],[7,13],[10]],[[1,2,3],[1,9],[3,6],[4]]]
fromCSV :: [String] -> [[[Int]]]
fromCSV = map parseOneLine
     where parseOneLine = map parseGroup . splitOn ","
               where parseGroup = map read . splitOn "+"

-- explode [[[1,2],[3],[4,5,6]], [[1, 2], [14,15], [16]]] --> [[1,3,4], 
[1,3,5],
-- [1,3,6],[2,3,4],[2,3,5],[2,3,6],[1,14,16],[1,15,16],[2,14,16], 
[2,15,16]]
explode :: [[[a]]] -> [[a]]
explode =  concatMap cp

-- toCSV [[8,11,7,10,12],[8,11,7,10,12],[8,11,7,10,12]] -->
-- ["8,11,7,10,12","8,11,7,10,12","8,11,7,10,12"]
toCSV :: (Show a) => [[a]] -> [String]
toCSV = map $ tail . init . show
--toCSV = map (intercalate "," . map show)

main = interact (unlines . toCSV . sort . explode . fromCSV . lines)

[Haskell-cafe] Space Efficiency When Sorting a List of Many Lists

Peter Green