
On 04/01/2010, at 10:43 AM, Stephen Blackheath [to Haskell-Beginners] wrote:
Peter,
I can't see anything in your code that stops it being lazy, and I also read the source for external-sort. It all looks OK except for one bit. There's even a comment that says "It would be better if I changed Ord for blocks to only check the first element." Looks like it's comparing the whole list instead of the first element when merging the lists, and so it's a bit difficult to guess how much it is evaluating of each list. If there are many repeated elements, it could be a lot.
You could try fixing that. You could hack kMerge using a newtype and an Ord instance (since Splay takes Ord instances).
Steve
Thanks Steve. I think I understand your point. It seems like external- sort needs patching to make block comparison more efficient for everyone and that my case is just particularly extreme/degenerate. Just right now, I'm not yet confident enough about my Haskell abilities to go and create a new type for 'block' with more efficient Ord and then modify the rest of the external-sort code accordingly. Posted program was the 2nd Haskell prog I've every written. I'll get there in time, though :). I wonder if there is another way of making my code work better with the existing externalSort function: I am sorting [[Int]], an example of which might be: [[8, 11, 7, 10], [8, 11, 13, 10], [12, 11, 7, 10], [12, 11, 13, 10], [1, 1, 3, 4], . . [3, 9, 6, 4]] I need numerical lex sorting, so I can't just map directly to ByteString as this would result in "12" collating before "1", etc. I know that each sublist I am sorting never contains an integer greater than 20. I could encode each sublist as a ByteString where 1 maps to a, 2 maps to B, 3, maps to C, etc. So would have [1,1,3,4] -> "aacd" (ByteString), etc. Hopefully Ord comparisons between ByteStrings are much less expensive than comparisons between instances of [Int] and I will see nicer behaviour since external-sort does work as advertised on lists of millions of Ints? So I could sort these, then map back to underlying Int data when I output. Schwartzian transform type stuff. Anyway, I'll give it a go.
Peter Green wrote:
I am using the external-sort package to sort my output in the program below. I made this choice because my output dataset [[Int]] can be large (e.g. >3M items, each [Int]).
What my program does:
(1) Reads in a file containing 'compressed lists' which look like so:
8+12,11,7+13,10 1+2+3,1+9,3+6,4 . .
One compressed list per line. These compressed lists are parsed to become [[[Int]]]
[[[8,12],[11],[7,13],[10]], [[1,2,3],[1,9],[3,6],[4]], . . ]
Generally files of compressed lists have lengths of ~10,000 lines.
(2) Compressed lists are exploded to [[int]] via concatMap Cartesian Product over [[[Int]]], so we end up with [[Int]]
[[8, 11, 7, 10], [8, 11, 13, 10], [12, 11, 7, 10], [12, 11, 13, 10], [1, 1, 3, 4], . . [3, 9, 6, 4]]
These 'exploded lists are' *much* longer than the input lists. It's common for them to have >100K [Int] members, or even >1M [Int] members.
(3) This [[Int]] data must be sorted in to lexicographic order and output as CSV data: i.e. output should be:
1,1,3,4 3,9,6,4 8,11,7,10 . . 12,11,13,10
There is no way to avoid the necessity of sorting the final product. I can *not* take advantage of any structure in the input data to avoid this sort stage. e.g. sorting the (smaller) input compressed lists *will not* obviate the need to lex sort the (larger) final output.
The program listing below works correctly, but *is not* space- efficient. In fact it is *less* space-efficient than the equivalent program using Prelude sort!
e.g. with one data set which explodes to ~3.7M I am seeing 639MB of total memory use.
I am sure externalSort is not at fault because: (a) I'm the Newbie. (b) ExternalSort.bin is 'only' 228MB in size. (Expected to be not small because sorting a [[Int]] with 3.7M [Int] members.)
I'm sure it must be something I'm doing wrong elsewhere which is causing the entire output data list to be read into memory.
What I really want to happen is:
(1) Data lazily read in from stdin and lazily parsed (2) Data lazily sucked out to disk and strictly sorted by externalSort (but no big deal since this happening on disk) (3) Data lazily sucked back out of externalSort, lazily formatted for output, and lazily written to stdout.
So memory usage should hopefully never go over (say 10MB). But in fact I'm using 638MB. So somewhere in the code below I must be doing something very wrong.
I hope someone can tell me what I am doing wrong here.
TIA!
import Algorithms.ExternalSort import Data.List.Split (splitOn) import System.IO
-- Cartesian Product over a List of Lists -- Http://www.cs.nott.ac.uk/~gmh/sudoku.lhs -- cp [[1,2],[3],[4,5,6]] --> [[1,3,4],[1,3,5],[1,3,6],[2,3,4],[2,3,5],[2,3,6]] cp :: [[a]] -> [[a]] cp [] = [[]] cp (xs:xss) = [y:ys | y <- xs, ys <- cp xss]
-- fromCSV ["8+12,11,7+13,10", "1+2+3,1+9,3+6,4"] --> -- [[[8,12],[11],[7,13],[10]],[[1,2,3],[1,9],[3,6],[4]]] fromCSV :: [String] -> [[[Int]]] fromCSV = map parseOneLine where parseOneLine = map parseGroup . splitOn "," where parseGroup = map read . splitOn "+"
-- explode [[[1,2],[3],[4,5,6]], [[1, 2], [14,15], [16]]] --> [[1,3,4],[1,3,5], -- [1,3,6],[2,3,4],[2,3,5],[2,3,6],[1,14,16],[1,15,16],[2,14,16], [2,15,16]] explode :: [[[a]]] -> [[a]] explode = concatMap cp
-- toSingles "8+12,11,7+13,10\n1+2+3,1+9,3+6,4" --> -- [[8,11,7,10],[8,11,13,10],[12,11,7,10],[12,11,13,10],[1,1,3,4], [1,1,6,4], -- [1,9,3,4],[1,9,6,4],[2,1,3,4],[2,1,6,4],[2,9,3,4],[2,9,6,4], [3,1,3,4], -- [3,1,6,4],[3,9,3,4],[3,9,6,4]] toSingles :: String -> [[Int]] toSingles = explode . fromCSV . lines
-- toCSV [8,11,7,10,12] --> "8,11,7,10,12" toCSV :: (Show a) => [a] -> String toCSV = tail . init . show
main = do getContents >>= externalSort . toSingles >>= mapM_ (putStrLn . toCSV)
_______________________________________________ Beginners mailing list Beginners@haskell.org http://www.haskell.org/mailman/listinfo/beginners