Re: [Haskell-beginners] Unexpected Space Leak (despite using external-sort)

4 Jan 2010

      On 04/01/2010, at 10:43 AM, Stephen Blackheath [to Haskell-Beginners]  
wrote:
...
Peter,
I can't see anything in your code that stops it being lazy, and I also
read the source for external-sort.  It all looks OK except for one  
bit.
There's even a comment that says "It would be better if I changed Ord
for blocks to only check the first element."  Looks like it's  
comparing
the whole list instead of the first element when merging the lists,  
and
so it's a bit difficult to guess how much it is evaluating of each  
list.
If there are many repeated elements, it could be a lot.
You could try fixing that.  You could hack kMerge using a newtype  
and an
Ord instance (since Splay takes Ord instances).
Steve
Thanks Steve. I think I understand your point. It seems like external- 
sort needs patching to make block comparison more efficient for  
everyone and that my case is just particularly extreme/degenerate.  
Just right now, I'm not yet confident enough about my Haskell  
abilities to go and create a new type for 'block' with more efficient  
Ord and then modify the rest of the external-sort code accordingly.  
Posted program was the 2nd Haskell prog I've every written. I'll get  
there in time, though :).

I wonder if there is another way of making my code work better with  
the existing externalSort function:

I am sorting [[Int]], an example of which might be:

[[8, 11, 7, 10],
[8, 11, 13, 10],
[12, 11, 7, 10],
[12, 11, 13, 10],
[1, 1, 3, 4],
.
.
[3, 9, 6, 4]]

I need numerical lex sorting, so I can't just map directly to  
ByteString as this would result in "12" collating before "1", etc.

I know that each sublist I am sorting never contains an integer  
greater than 20. I could encode each sublist as a ByteString where 1  
maps to a, 2 maps to B, 3, maps to C, etc.

So would have [1,1,3,4] -> "aacd" (ByteString), etc.

Hopefully Ord comparisons between ByteStrings are much less expensive  
than comparisons between instances of  [Int] and I will see nicer  
behaviour since external-sort does work as advertised on lists of  
millions of Ints?

So I could sort these, then map back to underlying Int data when I  
output. Schwartzian transform type stuff.

Anyway, I'll give it a go.
...
Peter Green wrote:
...
I am using the external-sort package to sort my output in the program
below. I made this choice because my output dataset [[Int]] can be  
large
(e.g. >3M items, each [Int]).
What my program does:
(1) Reads in a file containing 'compressed lists' which look like so:
8+12,11,7+13,10
1+2+3,1+9,3+6,4
.
.
One compressed list per line. These compressed lists are parsed to
become [[[Int]]]
[[[8,12],[11],[7,13],[10]],
[[1,2,3],[1,9],[3,6],[4]],
.
.
]
Generally files of compressed lists have lengths of ~10,000 lines.
(2) Compressed lists are exploded to [[int]] via concatMap Cartesian
Product over [[[Int]]], so we end up with [[Int]]
[[8, 11, 7, 10],
[8, 11, 13, 10],
[12, 11, 7, 10],
[12, 11, 13, 10],
[1, 1, 3, 4],
.
.
[3, 9, 6, 4]]
These 'exploded lists are' *much* longer than the input lists. It's
common for them to have >100K [Int] members, or even >1M [Int]  
members.
(3) This [[Int]] data must be sorted in to lexicographic order and
output as CSV data: i.e. output should be:
1,1,3,4
3,9,6,4
8,11,7,10
.
.
12,11,13,10
There is no way to avoid the necessity of sorting the final  
product. I
can *not* take advantage of any structure in the input data to avoid
this sort stage. e.g. sorting the (smaller) input compressed lists  
*will
not* obviate the need to lex sort the (larger) final output.
The program listing below works correctly, but *is not* space- 
efficient.
In fact it is *less* space-efficient than the equivalent  program  
using
Prelude sort!
e.g. with one data set which explodes to ~3.7M I am seeing 639MB of
total memory use.
I am sure externalSort is not at fault because:
(a) I'm the Newbie.
(b) ExternalSort.bin is 'only' 228MB in size. (Expected to be not  
small
because sorting a [[Int]] with 3.7M [Int] members.)
I'm sure it must be something I'm doing wrong elsewhere which is  
causing
the entire output data list to be read into memory.
What I really want to happen is:
(1) Data lazily read in from stdin and lazily parsed
(2) Data lazily sucked out to disk and strictly sorted by  
externalSort
(but no big deal since this happening on disk)
(3) Data lazily sucked back out of externalSort, lazily formatted for
output, and lazily written to stdout.
So memory usage should hopefully never go over (say 10MB). But in  
fact
I'm using 638MB. So somewhere in the code below I must be doing
something very wrong.
I hope someone can tell me what I am doing wrong here.
TIA!
import Algorithms.ExternalSort
import Data.List.Split (splitOn)
import System.IO
-- Cartesian Product over a List of Lists
-- Http://www.cs.nott.ac.uk/~gmh/sudoku.lhs
-- cp [[1,2],[3],[4,5,6]] -->
[[1,3,4],[1,3,5],[1,3,6],[2,3,4],[2,3,5],[2,3,6]]
cp          :: [[a]] -> [[a]]
cp []       =  [[]]
cp (xs:xss) =  [y:ys | y <- xs, ys <- cp xss]
-- fromCSV ["8+12,11,7+13,10", "1+2+3,1+9,3+6,4"] -->
-- [[[8,12],[11],[7,13],[10]],[[1,2,3],[1,9],[3,6],[4]]]
fromCSV :: [String] -> [[[Int]]]
fromCSV = map parseOneLine
   where parseOneLine = map parseGroup . splitOn ","
             where parseGroup = map read . splitOn "+"
-- explode [[[1,2],[3],[4,5,6]], [[1, 2], [14,15], [16]]] -->
[[1,3,4],[1,3,5],
-- [1,3,6],[2,3,4],[2,3,5],[2,3,6],[1,14,16],[1,15,16],[2,14,16], 
[2,15,16]]
explode :: [[[a]]] -> [[a]]
explode =  concatMap cp
-- toSingles "8+12,11,7+13,10\n1+2+3,1+9,3+6,4" -->
--
[[8,11,7,10],[8,11,13,10],[12,11,7,10],[12,11,13,10],[1,1,3,4], 
[1,1,6,4],
--  [1,9,3,4],[1,9,6,4],[2,1,3,4],[2,1,6,4],[2,9,3,4],[2,9,6,4], 
[3,1,3,4],
--  [3,1,6,4],[3,9,3,4],[3,9,6,4]]
toSingles :: String -> [[Int]]
toSingles = explode . fromCSV . lines
-- toCSV [8,11,7,10,12] --> "8,11,7,10,12"
toCSV :: (Show a) => [a] -> String
toCSV = tail . init . show
main = do
 getContents >>= externalSort . toSingles >>= mapM_ (putStrLn .  
toCSV)
_______________________________________________
Beginners mailing list
Beginners@haskell.org
http://www.haskell.org/mailman/listinfo/beginners