
I am using the external-sort package to sort my output in the program below. I made this choice because my output dataset [[Int]] can be large (e.g. >3M items, each [Int]). What my program does: (1) Reads in a file containing 'compressed lists' which look like so: 8+12,11,7+13,10 1+2+3,1+9,3+6,4 . . One compressed list per line. These compressed lists are parsed to become [[[Int]]] [[[8,12],[11],[7,13],[10]], [[1,2,3],[1,9],[3,6],[4]], . . ] Generally files of compressed lists have lengths of ~10,000 lines. (2) Compressed lists are exploded to [[int]] via concatMap Cartesian Product over [[[Int]]], so we end up with [[Int]] [[8, 11, 7, 10], [8, 11, 13, 10], [12, 11, 7, 10], [12, 11, 13, 10], [1, 1, 3, 4], . . [3, 9, 6, 4]] These 'exploded lists are' *much* longer than the input lists. It's common for them to have >100K [Int] members, or even >1M [Int] members. (3) This [[Int]] data must be sorted in to lexicographic order and output as CSV data: i.e. output should be: 1,1,3,4 3,9,6,4 8,11,7,10 . . 12,11,13,10 There is no way to avoid the necessity of sorting the final product. I can *not* take advantage of any structure in the input data to avoid this sort stage. e.g. sorting the (smaller) input compressed lists *will not* obviate the need to lex sort the (larger) final output. The program listing below works correctly, but *is not* space- efficient. In fact it is *less* space-efficient than the equivalent program using Prelude sort! e.g. with one data set which explodes to ~3.7M I am seeing 639MB of total memory use. I am sure externalSort is not at fault because: (a) I'm the Newbie. (b) ExternalSort.bin is 'only' 228MB in size. (Expected to be not small because sorting a [[Int]] with 3.7M [Int] members.) I'm sure it must be something I'm doing wrong elsewhere which is causing the entire output data list to be read into memory. What I really want to happen is: (1) Data lazily read in from stdin and lazily parsed (2) Data lazily sucked out to disk and strictly sorted by externalSort (but no big deal since this happening on disk) (3) Data lazily sucked back out of externalSort, lazily formatted for output, and lazily written to stdout. So memory usage should hopefully never go over (say 10MB). But in fact I'm using 638MB. So somewhere in the code below I must be doing something very wrong. I hope someone can tell me what I am doing wrong here. TIA! import Algorithms.ExternalSort import Data.List.Split (splitOn) import System.IO -- Cartesian Product over a List of Lists -- Http://www.cs.nott.ac.uk/~gmh/sudoku.lhs -- cp [[1,2],[3],[4,5,6]] --> [[1,3,4],[1,3,5],[1,3,6],[2,3,4],[2,3,5], [2,3,6]] cp :: [[a]] -> [[a]] cp [] = [[]] cp (xs:xss) = [y:ys | y <- xs, ys <- cp xss] -- fromCSV ["8+12,11,7+13,10", "1+2+3,1+9,3+6,4"] --> -- [[[8,12],[11],[7,13],[10]],[[1,2,3],[1,9],[3,6],[4]]] fromCSV :: [String] -> [[[Int]]] fromCSV = map parseOneLine where parseOneLine = map parseGroup . splitOn "," where parseGroup = map read . splitOn "+" -- explode [[[1,2],[3],[4,5,6]], [[1, 2], [14,15], [16]]] --> [[1,3,4], [1,3,5], -- [1,3,6],[2,3,4],[2,3,5],[2,3,6],[1,14,16],[1,15,16],[2,14,16], [2,15,16]] explode :: [[[a]]] -> [[a]] explode = concatMap cp -- toSingles "8+12,11,7+13,10\n1+2+3,1+9,3+6,4" --> -- [[8,11,7,10],[8,11,13,10],[12,11,7,10],[12,11,13,10],[1,1,3,4], [1,1,6,4], -- [1,9,3,4],[1,9,6,4],[2,1,3,4],[2,1,6,4],[2,9,3,4],[2,9,6,4], [3,1,3,4], -- [3,1,6,4],[3,9,3,4],[3,9,6,4]] toSingles :: String -> [[Int]] toSingles = explode . fromCSV . lines -- toCSV [8,11,7,10,12] --> "8,11,7,10,12" toCSV :: (Show a) => [a] -> String toCSV = tail . init . show main = do getContents >>= externalSort . toSingles >>= mapM_ (putStrLn . toCSV)

Peter, I can't see anything in your code that stops it being lazy, and I also read the source for external-sort. It all looks OK except for one bit. There's even a comment that says "It would be better if I changed Ord for blocks to only check the first element." Looks like it's comparing the whole list instead of the first element when merging the lists, and so it's a bit difficult to guess how much it is evaluating of each list. If there are many repeated elements, it could be a lot. You could try fixing that. You could hack kMerge using a newtype and an Ord instance (since Splay takes Ord instances). Steve Peter Green wrote:
I am using the external-sort package to sort my output in the program below. I made this choice because my output dataset [[Int]] can be large (e.g. >3M items, each [Int]).
What my program does:
(1) Reads in a file containing 'compressed lists' which look like so:
8+12,11,7+13,10 1+2+3,1+9,3+6,4 . .
One compressed list per line. These compressed lists are parsed to become [[[Int]]]
[[[8,12],[11],[7,13],[10]], [[1,2,3],[1,9],[3,6],[4]], . . ]
Generally files of compressed lists have lengths of ~10,000 lines.
(2) Compressed lists are exploded to [[int]] via concatMap Cartesian Product over [[[Int]]], so we end up with [[Int]]
[[8, 11, 7, 10], [8, 11, 13, 10], [12, 11, 7, 10], [12, 11, 13, 10], [1, 1, 3, 4], . . [3, 9, 6, 4]]
These 'exploded lists are' *much* longer than the input lists. It's common for them to have >100K [Int] members, or even >1M [Int] members.
(3) This [[Int]] data must be sorted in to lexicographic order and output as CSV data: i.e. output should be:
1,1,3,4 3,9,6,4 8,11,7,10 . . 12,11,13,10
There is no way to avoid the necessity of sorting the final product. I can *not* take advantage of any structure in the input data to avoid this sort stage. e.g. sorting the (smaller) input compressed lists *will not* obviate the need to lex sort the (larger) final output.
The program listing below works correctly, but *is not* space-efficient. In fact it is *less* space-efficient than the equivalent program using Prelude sort!
e.g. with one data set which explodes to ~3.7M I am seeing 639MB of total memory use.
I am sure externalSort is not at fault because: (a) I'm the Newbie. (b) ExternalSort.bin is 'only' 228MB in size. (Expected to be not small because sorting a [[Int]] with 3.7M [Int] members.)
I'm sure it must be something I'm doing wrong elsewhere which is causing the entire output data list to be read into memory.
What I really want to happen is:
(1) Data lazily read in from stdin and lazily parsed (2) Data lazily sucked out to disk and strictly sorted by externalSort (but no big deal since this happening on disk) (3) Data lazily sucked back out of externalSort, lazily formatted for output, and lazily written to stdout.
So memory usage should hopefully never go over (say 10MB). But in fact I'm using 638MB. So somewhere in the code below I must be doing something very wrong.
I hope someone can tell me what I am doing wrong here.
TIA!
import Algorithms.ExternalSort import Data.List.Split (splitOn) import System.IO
-- Cartesian Product over a List of Lists -- Http://www.cs.nott.ac.uk/~gmh/sudoku.lhs -- cp [[1,2],[3],[4,5,6]] --> [[1,3,4],[1,3,5],[1,3,6],[2,3,4],[2,3,5],[2,3,6]] cp :: [[a]] -> [[a]] cp [] = [[]] cp (xs:xss) = [y:ys | y <- xs, ys <- cp xss]
-- fromCSV ["8+12,11,7+13,10", "1+2+3,1+9,3+6,4"] --> -- [[[8,12],[11],[7,13],[10]],[[1,2,3],[1,9],[3,6],[4]]] fromCSV :: [String] -> [[[Int]]] fromCSV = map parseOneLine where parseOneLine = map parseGroup . splitOn "," where parseGroup = map read . splitOn "+"
-- explode [[[1,2],[3],[4,5,6]], [[1, 2], [14,15], [16]]] --> [[1,3,4],[1,3,5], -- [1,3,6],[2,3,4],[2,3,5],[2,3,6],[1,14,16],[1,15,16],[2,14,16],[2,15,16]] explode :: [[[a]]] -> [[a]] explode = concatMap cp
-- toSingles "8+12,11,7+13,10\n1+2+3,1+9,3+6,4" --> -- [[8,11,7,10],[8,11,13,10],[12,11,7,10],[12,11,13,10],[1,1,3,4],[1,1,6,4], -- [1,9,3,4],[1,9,6,4],[2,1,3,4],[2,1,6,4],[2,9,3,4],[2,9,6,4],[3,1,3,4], -- [3,1,6,4],[3,9,3,4],[3,9,6,4]] toSingles :: String -> [[Int]] toSingles = explode . fromCSV . lines
-- toCSV [8,11,7,10,12] --> "8,11,7,10,12" toCSV :: (Show a) => [a] -> String toCSV = tail . init . show
main = do getContents >>= externalSort . toSingles >>= mapM_ (putStrLn . toCSV)
_______________________________________________ Beginners mailing list Beginners@haskell.org http://www.haskell.org/mailman/listinfo/beginners

On 04/01/2010, at 10:43 AM, Stephen Blackheath [to Haskell-Beginners] wrote:
Peter,
I can't see anything in your code that stops it being lazy, and I also read the source for external-sort. It all looks OK except for one bit. There's even a comment that says "It would be better if I changed Ord for blocks to only check the first element." Looks like it's comparing the whole list instead of the first element when merging the lists, and so it's a bit difficult to guess how much it is evaluating of each list. If there are many repeated elements, it could be a lot.
You could try fixing that. You could hack kMerge using a newtype and an Ord instance (since Splay takes Ord instances).
Steve
Thanks Steve. I think I understand your point. It seems like external- sort needs patching to make block comparison more efficient for everyone and that my case is just particularly extreme/degenerate. Just right now, I'm not yet confident enough about my Haskell abilities to go and create a new type for 'block' with more efficient Ord and then modify the rest of the external-sort code accordingly. Posted program was the 2nd Haskell prog I've every written. I'll get there in time, though :). I wonder if there is another way of making my code work better with the existing externalSort function: I am sorting [[Int]], an example of which might be: [[8, 11, 7, 10], [8, 11, 13, 10], [12, 11, 7, 10], [12, 11, 13, 10], [1, 1, 3, 4], . . [3, 9, 6, 4]] I need numerical lex sorting, so I can't just map directly to ByteString as this would result in "12" collating before "1", etc. I know that each sublist I am sorting never contains an integer greater than 20. I could encode each sublist as a ByteString where 1 maps to a, 2 maps to B, 3, maps to C, etc. So would have [1,1,3,4] -> "aacd" (ByteString), etc. Hopefully Ord comparisons between ByteStrings are much less expensive than comparisons between instances of [Int] and I will see nicer behaviour since external-sort does work as advertised on lists of millions of Ints? So I could sort these, then map back to underlying Int data when I output. Schwartzian transform type stuff. Anyway, I'll give it a go.
Peter Green wrote:
I am using the external-sort package to sort my output in the program below. I made this choice because my output dataset [[Int]] can be large (e.g. >3M items, each [Int]).
What my program does:
(1) Reads in a file containing 'compressed lists' which look like so:
8+12,11,7+13,10 1+2+3,1+9,3+6,4 . .
One compressed list per line. These compressed lists are parsed to become [[[Int]]]
[[[8,12],[11],[7,13],[10]], [[1,2,3],[1,9],[3,6],[4]], . . ]
Generally files of compressed lists have lengths of ~10,000 lines.
(2) Compressed lists are exploded to [[int]] via concatMap Cartesian Product over [[[Int]]], so we end up with [[Int]]
[[8, 11, 7, 10], [8, 11, 13, 10], [12, 11, 7, 10], [12, 11, 13, 10], [1, 1, 3, 4], . . [3, 9, 6, 4]]
These 'exploded lists are' *much* longer than the input lists. It's common for them to have >100K [Int] members, or even >1M [Int] members.
(3) This [[Int]] data must be sorted in to lexicographic order and output as CSV data: i.e. output should be:
1,1,3,4 3,9,6,4 8,11,7,10 . . 12,11,13,10
There is no way to avoid the necessity of sorting the final product. I can *not* take advantage of any structure in the input data to avoid this sort stage. e.g. sorting the (smaller) input compressed lists *will not* obviate the need to lex sort the (larger) final output.
The program listing below works correctly, but *is not* space- efficient. In fact it is *less* space-efficient than the equivalent program using Prelude sort!
e.g. with one data set which explodes to ~3.7M I am seeing 639MB of total memory use.
I am sure externalSort is not at fault because: (a) I'm the Newbie. (b) ExternalSort.bin is 'only' 228MB in size. (Expected to be not small because sorting a [[Int]] with 3.7M [Int] members.)
I'm sure it must be something I'm doing wrong elsewhere which is causing the entire output data list to be read into memory.
What I really want to happen is:
(1) Data lazily read in from stdin and lazily parsed (2) Data lazily sucked out to disk and strictly sorted by externalSort (but no big deal since this happening on disk) (3) Data lazily sucked back out of externalSort, lazily formatted for output, and lazily written to stdout.
So memory usage should hopefully never go over (say 10MB). But in fact I'm using 638MB. So somewhere in the code below I must be doing something very wrong.
I hope someone can tell me what I am doing wrong here.
TIA!
import Algorithms.ExternalSort import Data.List.Split (splitOn) import System.IO
-- Cartesian Product over a List of Lists -- Http://www.cs.nott.ac.uk/~gmh/sudoku.lhs -- cp [[1,2],[3],[4,5,6]] --> [[1,3,4],[1,3,5],[1,3,6],[2,3,4],[2,3,5],[2,3,6]] cp :: [[a]] -> [[a]] cp [] = [[]] cp (xs:xss) = [y:ys | y <- xs, ys <- cp xss]
-- fromCSV ["8+12,11,7+13,10", "1+2+3,1+9,3+6,4"] --> -- [[[8,12],[11],[7,13],[10]],[[1,2,3],[1,9],[3,6],[4]]] fromCSV :: [String] -> [[[Int]]] fromCSV = map parseOneLine where parseOneLine = map parseGroup . splitOn "," where parseGroup = map read . splitOn "+"
-- explode [[[1,2],[3],[4,5,6]], [[1, 2], [14,15], [16]]] --> [[1,3,4],[1,3,5], -- [1,3,6],[2,3,4],[2,3,5],[2,3,6],[1,14,16],[1,15,16],[2,14,16], [2,15,16]] explode :: [[[a]]] -> [[a]] explode = concatMap cp
-- toSingles "8+12,11,7+13,10\n1+2+3,1+9,3+6,4" --> -- [[8,11,7,10],[8,11,13,10],[12,11,7,10],[12,11,13,10],[1,1,3,4], [1,1,6,4], -- [1,9,3,4],[1,9,6,4],[2,1,3,4],[2,1,6,4],[2,9,3,4],[2,9,6,4], [3,1,3,4], -- [3,1,6,4],[3,9,3,4],[3,9,6,4]] toSingles :: String -> [[Int]] toSingles = explode . fromCSV . lines
-- toCSV [8,11,7,10,12] --> "8,11,7,10,12" toCSV :: (Show a) => [a] -> String toCSV = tail . init . show
main = do getContents >>= externalSort . toSingles >>= mapM_ (putStrLn . toCSV)
_______________________________________________ Beginners mailing list Beginners@haskell.org http://www.haskell.org/mailman/listinfo/beginners

Peter, See below...
Thanks Steve. I think I understand your point. It seems like external-sort needs patching to make block comparison more efficient for everyone and that my case is just particularly extreme/degenerate. Just right now, I'm not yet confident enough about my Haskell abilities to go and create a new type for 'block' with more efficient Ord and then modify the rest of the external-sort code accordingly. Posted program was the 2nd Haskell prog I've every written. I'll get there in time, though :).
Laziness and space analysis make up a significant part of Haskell's learning curve, I believe, so you're doing well. Here's how I'm suggesting you change kMerge
kMerge :: (Ord a) => [[a]] -> [a] kMerge [] = [] kMerge l = let h = Splay.fromSeq l in kM (Splay.minElem h) (Splay.deleteMin h) where kM :: (Ord a) => [a] -> Splay.Heap [a] -> [a] kM l h | h == Splay.empty = l | otherwise = let next = Splay.minElem h (f, b) = span (\x -> x <= head next) l in f ++ (kM next (if null b then Splay.deleteMin h else (Splay.insert b $ Splay.deleteMin h)))
becomes something like... (with lots of wrapping and unwrapping)...
newtype HeadCompList a = HeadCompList [a]
instance Ord a => Ord (HeadCompList a) where HeadCompList (x:_) `compare` HeadCompList (y:_) = x `compare` y -- compare head elts only
kMerge :: (Ord a) => [[a]] -> [a] kMerge [] = [] kMerge l = let wrappedL = map HeadCompList l h = Splay.fromSeq wrappedL in kM (Splay.minElem h) (Splay.deleteMin h) where kM :: (Ord a) => HeadCompList a -> Splay.Heap (HeadCompList a) -> [a] kM (HeadCompList l) h | h == Splay.empty = l | otherwise = let (HeadCompList next) = Splay.minElem h (f, b) = span (\x -> x <= head next) l in f ++ (kM next (if null b then Splay.deleteMin h else (Splay.insert (HeadCompList b) $ Splay.deleteMin h)))
I wonder if there is another way of making my code work better with the existing externalSort function:
I am sorting [[Int]], an example of which might be:
[[8, 11, 7, 10], [8, 11, 13, 10], [12, 11, 7, 10], [12, 11, 13, 10], [1, 1, 3, 4], . . [3, 9, 6, 4]]
I need numerical lex sorting, so I can't just map directly to ByteString as this would result in "12" collating before "1", etc.
I know that each sublist I am sorting never contains an integer greater than 20. I could encode each sublist as a ByteString where 1 maps to a, 2 maps to B, 3, maps to C, etc.
So would have [1,1,3,4] -> "aacd" (ByteString), etc.
Hopefully Ord comparisons between ByteStrings are much less expensive than comparisons between instances of [Int] and I will see nicer behaviour since external-sort does work as advertised on lists of millions of Ints?
Well, if I'm right, then the problem is that kMerge is evaluating too much of each of its ... in your case .. [[Int]] values. You're proposing to change it to [ByteString]. Again, if I'm right, that would improve it by exactly the factor of space usage by ByteString / [Int], without solving the fundamental problem.
So I could sort these, then map back to underlying Int data when I output. Schwartzian transform type stuff.
Anyway, I'll give it a go.
I may be wrong! Steve

Am Montag 04 Januar 2010 02:35:04 schrieb Peter Green:
I am using the external-sort package to sort my output in the program below. I made this choice because my output dataset [[Int]] can be large (e.g. >3M items, each [Int]).
What my program does:
(1) Reads in a file containing 'compressed lists' which look like so:
8+12,11,7+13,10 1+2+3,1+9,3+6,4 . .
One compressed list per line. These compressed lists are parsed to become [[[Int]]]
[[[8,12],[11],[7,13],[10]], [[1,2,3],[1,9],[3,6],[4]], . . ]
Generally files of compressed lists have lengths of ~10,000 lines.
(2) Compressed lists are exploded to [[int]] via concatMap Cartesian Product over [[[Int]]], so we end up with [[Int]]
[[8, 11, 7, 10], [8, 11, 13, 10], [12, 11, 7, 10], [12, 11, 13, 10], [1, 1, 3, 4], . . [3, 9, 6, 4]]
I have a different idea. You want to consume the data in order, don't you? You can interleave sorting and exploding then: 1. read in data (~10,000 lists of lists. In your examples, each line contains a list of four short lists, the nested lists containing 1-3 elements. I hope that's not too far from reality.) 2. map (\l -> ([],l)) to get [([],[[8,12],[11],[7,13],[10]]), ([],[[1,2,3],[1,9],[3,6], [4]]), ... ] 3. split first sublists of second component and append to the first component (empty list) concatMap (\(acc,(toSplit:rest)) -> [(acc ++ [x],rest) | x <- toSplit]) [([],[[8,12],[11],[7,13],[10]]), ([],[[1,2,3],[1,9],[3,6],[4]]), . . ] ~> [([8],[[11],[7,13],[10]]), ([12],[[11],[7,13],[10]]), ([1],[[1,9],[3,6],[4]]), ([2],[[1,9],[3,6],[4]]), ([3],[[1,9],[3,6],[4]]), . . ] A list of ~30,000(?) ([Int],[[Int]]) pairs 4. sortBy (comparing fst) {-# import Data.Ord (comparing) #} 5. groupBy ((==) `on` fst) {-# import Data.Function (on) #-} You now have something like [ [([1],[[1,9],[3,6],[4]]), ([1],[[?],[?],[?]]), . . ], [([2],[[1,9],[3,6],[4]]), ([2],[[?],[?],[?]]), . . ], . . [([131],[[?],[?],[?]])], ([131],[?]), . . ] ] 6. If necessary, repeat 3. to 5. on each of the groups to get [ [ [([1,1],[[3,6],[4]]), ([1,1],[[?],[?]]), . . ], [([1,2],[[?],[?]]), ([1,2],[[?],[?]]), . . ], . . ] ] concat to get [ [([1,1],...), . . ], [([1,2],...), . . ], . . ] , iterate until exploding in step 8. produces something manageable. 7. map (\g@((h,_):_) -> (h, map snd g)) You now have something of type [([Int],[[[Int]]])], [([1],[ [[1,9],[3,6],[4]], [[?],[?],[?]],...]), ([2],[[[1,9],[3,6],[4]],[[?],[?],[?]],...]), . . ([131],[[[?],[?],[?]],...]) ] or [([1,1],[ [[3,6],[4]], [[?],[?]],...]), ([1,2],[ [[?],[?]], [[?],[?]],...]), . . ] 8. explode second components in each of the pairs, sort, append to first component, consume. Thanks to laziness, the [[[Int]]] of the second group will only be exploded after the first group has been consumed. One caveat: if the lines do not always contain lists of equal length, you must not repeat steps 3.-6. more often than the shortest length or modify the function in 3. to handle that, e.g. partialExplode (acc,toSplit:rest) = [(acc + [x],rest) | x <- toSplit] partialExplode (acc,[]) = [(acc,[])]
-- Cartesian Product over a List of Lists -- Http://www.cs.nott.ac.uk/~gmh/sudoku.lhs -- cp [[1,2],[3],[4,5,6]] --> [[1,3,4],[1,3,5],[1,3,6],[2,3,4],[2,3,5], [2,3,6]] cp :: [[a]] -> [[a]] cp [] = [[]] cp (xs:xss) = [y:ys | y <- xs, ys <- cp xss]
change that to cp (xs:xss) = [y:ys | ys <- cp xss, y <- xs] to get something more memory friendly.
participants (3)
-
Daniel Fischer
-
Peter Green
-
Stephen Blackheath [to Haskell-Beginners]