
Dean Herington
"Ketil Z. Malde" wrote:
-- | add data from a file to the histogram addFile :: FiniteMap String Int -> String -> IO (FiniteMap String Int) addFile fm name = do x <- readFile name return (addHist fm x)
I changed this to read x strictly, but it turned out that wasn't quite enough. See below.
-- | add data from all files in a directory to the histogram addDir :: FiniteMap String Int -> String -> IO (FiniteMap String Int) addDir fm dir = do dc <- getDirectoryContents dir fs <- filterM doesFileExist (map ((dir++"/")++) dc) foldM addFile fm fs
addHist :: FiniteMap String Int -> String -> FiniteMap String Int addHist fm = foldl add1 fm . words where add1 f w = case lookupFM f w of Just n -> addToFM f w (n+1) Nothing -> addToFM f w 1
The string won't be evaluated until the finite map is demanded. None of the code you've shown so far makes that demand.
[I'll leave my ramblings here, skip to the ==== for the solution] Right. But if I do f <- addFile emptyFM "foo" then no IO should be performed either, right? And when I later do print (head . fmToList f) the whole FM needs to be computed, no? If this is unsound (too lazy) in any way, I don't see it. How much memory can I expect an FM to consume, anyway? I would expect something like (log n * sizeof(branch node)) + n*sizeof(key+element) Which should be roughly equivalent to n*sizeof(key+element), which in my case (building a frequency count for words) should be roughly equivalent to the number of different words times their average length (times 8 bytes per character in a list). Is this far off? Experiments show that doing x <- addDir emptyFM on a directory containing 13Mb of files, the process grows to about 400Mb. Surprisingly, asking for the head requires a lot more memory and takes considerable time. ===== So, what's really happening? Well, it seems the files were read strictly, but the FMs weren't constructed yet. So the result was a lot of files residing in memory, stored as expansive [Char]s. Changing addFile to:
addFile :: FiniteMap String Int -> String -> IO (FiniteMap String Int) addFile fm name = do h <- openFile name ReadMode x <- hGetContents h let f = addHist fm x hClose (f `seq` h) -- thanks to Peter Thiemann return f
With this, I can at least plough through the 13Mb of files comfortably. After fixing a bug in my overlong word elimination code, but a directory with 50Mb is still out of reach. What is the limit on open files, and why? I think it'd be nice to just schedule a huge amount of IO operations, and have them all be performed when required (i.e. when the FM is first accessed). Apparently, my addDir took the trouble to open the files, but not generate the FMs -- any idea why? Finally, the nice thing about Haskell is that the frequency counting code was written and tested in less than five minutes. The downside is that getting the IO and strictness right, took around five hours. :-/ -kzm -- If I haven't seen further, it is by standing in the footprints of giants