
Am Montag 04 Januar 2010 02:35:04 schrieb Peter Green:
I am using the external-sort package to sort my output in the program below. I made this choice because my output dataset [[Int]] can be large (e.g. >3M items, each [Int]).
What my program does:
(1) Reads in a file containing 'compressed lists' which look like so:
8+12,11,7+13,10 1+2+3,1+9,3+6,4 . .
One compressed list per line. These compressed lists are parsed to become [[[Int]]]
[[[8,12],[11],[7,13],[10]], [[1,2,3],[1,9],[3,6],[4]], . . ]
Generally files of compressed lists have lengths of ~10,000 lines.
(2) Compressed lists are exploded to [[int]] via concatMap Cartesian Product over [[[Int]]], so we end up with [[Int]]
[[8, 11, 7, 10], [8, 11, 13, 10], [12, 11, 7, 10], [12, 11, 13, 10], [1, 1, 3, 4], . . [3, 9, 6, 4]]
I have a different idea. You want to consume the data in order, don't you? You can interleave sorting and exploding then: 1. read in data (~10,000 lists of lists. In your examples, each line contains a list of four short lists, the nested lists containing 1-3 elements. I hope that's not too far from reality.) 2. map (\l -> ([],l)) to get [([],[[8,12],[11],[7,13],[10]]), ([],[[1,2,3],[1,9],[3,6], [4]]), ... ] 3. split first sublists of second component and append to the first component (empty list) concatMap (\(acc,(toSplit:rest)) -> [(acc ++ [x],rest) | x <- toSplit]) [([],[[8,12],[11],[7,13],[10]]), ([],[[1,2,3],[1,9],[3,6],[4]]), . . ] ~> [([8],[[11],[7,13],[10]]), ([12],[[11],[7,13],[10]]), ([1],[[1,9],[3,6],[4]]), ([2],[[1,9],[3,6],[4]]), ([3],[[1,9],[3,6],[4]]), . . ] A list of ~30,000(?) ([Int],[[Int]]) pairs 4. sortBy (comparing fst) {-# import Data.Ord (comparing) #} 5. groupBy ((==) `on` fst) {-# import Data.Function (on) #-} You now have something like [ [([1],[[1,9],[3,6],[4]]), ([1],[[?],[?],[?]]), . . ], [([2],[[1,9],[3,6],[4]]), ([2],[[?],[?],[?]]), . . ], . . [([131],[[?],[?],[?]])], ([131],[?]), . . ] ] 6. If necessary, repeat 3. to 5. on each of the groups to get [ [ [([1,1],[[3,6],[4]]), ([1,1],[[?],[?]]), . . ], [([1,2],[[?],[?]]), ([1,2],[[?],[?]]), . . ], . . ] ] concat to get [ [([1,1],...), . . ], [([1,2],...), . . ], . . ] , iterate until exploding in step 8. produces something manageable. 7. map (\g@((h,_):_) -> (h, map snd g)) You now have something of type [([Int],[[[Int]]])], [([1],[ [[1,9],[3,6],[4]], [[?],[?],[?]],...]), ([2],[[[1,9],[3,6],[4]],[[?],[?],[?]],...]), . . ([131],[[[?],[?],[?]],...]) ] or [([1,1],[ [[3,6],[4]], [[?],[?]],...]), ([1,2],[ [[?],[?]], [[?],[?]],...]), . . ] 8. explode second components in each of the pairs, sort, append to first component, consume. Thanks to laziness, the [[[Int]]] of the second group will only be exploded after the first group has been consumed. One caveat: if the lines do not always contain lists of equal length, you must not repeat steps 3.-6. more often than the shortest length or modify the function in 3. to handle that, e.g. partialExplode (acc,toSplit:rest) = [(acc + [x],rest) | x <- toSplit] partialExplode (acc,[]) = [(acc,[])]
-- Cartesian Product over a List of Lists -- Http://www.cs.nott.ac.uk/~gmh/sudoku.lhs -- cp [[1,2],[3],[4,5,6]] --> [[1,3,4],[1,3,5],[1,3,6],[2,3,4],[2,3,5], [2,3,6]] cp :: [[a]] -> [[a]] cp [] = [[]] cp (xs:xss) = [y:ys | y <- xs, ys <- cp xss]
change that to cp (xs:xss) = [y:ys | ys <- cp xss, y <- xs] to get something more memory friendly.