
Yes, getting the right amount of strictness--and in the right places--can be tricky. For your program, it seems you should process each file completely (reflecting its contents strictly in your histogram so the contents can be dropped and the file descriptor closed) before moving on to the next file. This would reduce the number of file descriptors needed to one and limit memory requirements to O(m), where m is the maximum size of a file. In addition, you should process each file incrementally, reducing the memory requirements to O(1). Specific comments follow. "Ketil Z. Malde" wrote:
Dean Herington
writes: "Ketil Z. Malde" wrote:
-- | add data from a file to the histogram addFile :: FiniteMap String Int -> String -> IO (FiniteMap String Int) addFile fm name = do x <- readFile name return (addHist fm x)
I changed this to read x strictly, but it turned out that wasn't quite enough. See below.
Why read a file's contents strictly? What's important is to tabulate its contribution to the histogram strictly. Perhaps return $! addHist fm x might be enough to achieve that (*).
-- | add data from all files in a directory to the histogram addDir :: FiniteMap String Int -> String -> IO (FiniteMap String Int) addDir fm dir = do dc <- getDirectoryContents dir fs <- filterM doesFileExist (map ((dir++"/")++) dc) foldM addFile fm fs
addHist :: FiniteMap String Int -> String -> FiniteMap String Int addHist fm = foldl add1 fm . words where add1 f w = case lookupFM f w of Just n -> addToFM f w (n+1) Nothing -> addToFM f w 1
You should do the counting strictly: Just n -> case n+1 of n1 -> addToFM f w n1 Your current limitation on the size of a directory that your program can handle may be due to these unevaluated thunks representing (n+1) calculations.
The string won't be evaluated until the finite map is demanded. None of the code you've shown so far makes that demand.
[I'll leave my ramblings here, skip to the ==== for the solution]
Right. But if I do
f <- addFile emptyFM "foo"
then no IO should be performed either, right?
Right, except that, as Simon M. mentioned, the file is opened so that any opening exceptions are triggered.
And when I later do
print (head . fmToList f)
the whole FM needs to be computed, no?
Perhaps. You're only demanding the head of the list. Conceivably, the FM logic might be able to determine the lowest key/element pair without evaluating the entire map.
If this is unsound (too lazy) in any way, I don't see it. How much memory can I expect an FM to consume, anyway? I would expect something like
(log n * sizeof(branch node)) + n*sizeof(key+element)
Which should be roughly equivalent to n*sizeof(key+element), which in my case (building a frequency count for words) should be roughly equivalent to the number of different words times their average length (times 8 bytes per character in a list). Is this far off?
Experiments show that doing x <- addDir emptyFM on a directory containing 13Mb of files, the process grows to about 400Mb. Surprisingly, asking for the head requires a lot more memory and takes considerable time.
=====
So, what's really happening? Well, it seems the files were read strictly, but the FMs weren't constructed yet. So the result was a lot of files residing in memory, stored as expansive [Char]s. Changing addFile to:
addFile :: FiniteMap String Int -> String -> IO (FiniteMap String Int) addFile fm name = do h <- openFile name ReadMode x <- hGetContents h let f = addHist fm x hClose (f `seq` h) -- thanks to Peter Thiemann return f
I find the above approach a bit risky, as you are closing the file after having only shallowly demanded the result of addHist. My earlier suggestion, return $! addHist fm x, makes exactly the same shallow demand, but if that demand is insufficient, loses performance but not correctness. I would recommend letting the runtime system close the file and just being careful to read to end of file before moving on to the next file.
With this, I can at least plough through the 13Mb of files comfortably. After fixing a bug in my overlong word elimination code, but a directory with 50Mb is still out of reach.
When you get the strictness "right", you should suffer no such limit.
What is the limit on open files, and why? I think it'd be nice to just schedule a huge amount of IO operations, and have them all be performed when required (i.e. when the FM is first accessed). Apparently, my addDir took the trouble to open the files, but not generate the FMs -- any idea why?
Finally, the nice thing about Haskell is that the frequency counting code was written and tested in less than five minutes. The downside is that getting the IO and strictness right, took around five hours. :-/
(*) I've had problems myself with the use of finite maps without sufficient strictness. It seems that perhaps the FiniteMap interface should provide some explicit support for strictness. Anybody have thoughts on that? -- Dean