RE: Yet another weakly defined bug report

The following code runs out of file handles, so I am seeking enlightenment as to why.
-- | add data from a file to the histogram addFile :: FiniteMap String Int -> String -> IO (FiniteMap String Int) addFile fm name = do x <- readFile name return (addHist fm x)
-- | add data from all files in a directory to the histogram addDir :: FiniteMap String Int -> String -> IO (FiniteMap String Int) addDir fm dir = do dc <- getDirectoryContents dir fs <- filterM doesFileExist (map ((dir++"/")++) dc) foldM addFile fm fs
Running addDir on a directory with few hundred files trigs it. Shouldn't all of each file's contents be consumed by each addFile, and thus the handle be released?
It's not possible to tell from this code whether the readFiles will be fully evaluated or not: it depends on how much evaluation addHist does, and to what extend the result FiniteMap is demanded, amongst other things. These things are always tricky to understand, which is why I recommend not using lazy I/O. File reading is not a pure operation: running out of file descriptors is a good counter-example. Cheers, Simon

"Simon Marlow"
-- | add data from a file to the histogram addFile :: FiniteMap String Int -> String -> IO (FiniteMap String Int) addFile fm name = do x <- readFile name return (addHist fm x)
-- | add data from all files in a directory to the histogram addDir :: FiniteMap String Int -> String -> IO (FiniteMap String Int) addDir fm dir = do dc <- getDirectoryContents dir fs <- filterM doesFileExist (map ((dir++"/")++) dc) foldM addFile fm fs
It's not possible to tell from this code whether the readFiles will be fully evaluated or not: it depends on how much evaluation addHist does, and to what extend the result FiniteMap is demanded, amongst other things.
Of course. Never cut code, I suppose; I thought the parts my understanding would be weakest would be the monadic stuff, not this:
addHist :: FiniteMap String Int -> String -> FiniteMap String Int addHist fm = foldl add1 fm . words where add1 f w = case lookupFM f w of Just n -> addToFM f w (n+1) Nothing -> addToFM f w 1
I felt pretty sure that this would evaluate the string to the end. Am I wrong?
These things are always tricky to understand, which is why I recommend not using lazy I/O. File reading is not a pure operation: running out of file descriptors is a good counter-example.
Okay. Perhaps renaming "readFile" to "unsafeReadFile"? :-) -kzm -- If I haven't seen further, it is by standing in the footprints of giants

"Ketil Z. Malde" wrote:
"Simon Marlow"
writes: -- | add data from a file to the histogram addFile :: FiniteMap String Int -> String -> IO (FiniteMap String Int) addFile fm name = do x <- readFile name return (addHist fm x)
-- | add data from all files in a directory to the histogram addDir :: FiniteMap String Int -> String -> IO (FiniteMap String Int) addDir fm dir = do dc <- getDirectoryContents dir fs <- filterM doesFileExist (map ((dir++"/")++) dc) foldM addFile fm fs
It's not possible to tell from this code whether the readFiles will be fully evaluated or not: it depends on how much evaluation addHist does, and to what extend the result FiniteMap is demanded, amongst other things.
Of course. Never cut code, I suppose; I thought the parts my understanding would be weakest would be the monadic stuff, not this:
addHist :: FiniteMap String Int -> String -> FiniteMap String Int addHist fm = foldl add1 fm . words where add1 f w = case lookupFM f w of Just n -> addToFM f w (n+1) Nothing -> addToFM f w 1
I felt pretty sure that this would evaluate the string to the end. Am I wrong?
The string won't be evaluated until the finite map is demanded. None of the code you've shown so far makes that demand. -- Dean

Dean Herington
"Ketil Z. Malde" wrote:
-- | add data from a file to the histogram addFile :: FiniteMap String Int -> String -> IO (FiniteMap String Int) addFile fm name = do x <- readFile name return (addHist fm x)
I changed this to read x strictly, but it turned out that wasn't quite enough. See below.
-- | add data from all files in a directory to the histogram addDir :: FiniteMap String Int -> String -> IO (FiniteMap String Int) addDir fm dir = do dc <- getDirectoryContents dir fs <- filterM doesFileExist (map ((dir++"/")++) dc) foldM addFile fm fs
addHist :: FiniteMap String Int -> String -> FiniteMap String Int addHist fm = foldl add1 fm . words where add1 f w = case lookupFM f w of Just n -> addToFM f w (n+1) Nothing -> addToFM f w 1
The string won't be evaluated until the finite map is demanded. None of the code you've shown so far makes that demand.
[I'll leave my ramblings here, skip to the ==== for the solution] Right. But if I do f <- addFile emptyFM "foo" then no IO should be performed either, right? And when I later do print (head . fmToList f) the whole FM needs to be computed, no? If this is unsound (too lazy) in any way, I don't see it. How much memory can I expect an FM to consume, anyway? I would expect something like (log n * sizeof(branch node)) + n*sizeof(key+element) Which should be roughly equivalent to n*sizeof(key+element), which in my case (building a frequency count for words) should be roughly equivalent to the number of different words times their average length (times 8 bytes per character in a list). Is this far off? Experiments show that doing x <- addDir emptyFM on a directory containing 13Mb of files, the process grows to about 400Mb. Surprisingly, asking for the head requires a lot more memory and takes considerable time. ===== So, what's really happening? Well, it seems the files were read strictly, but the FMs weren't constructed yet. So the result was a lot of files residing in memory, stored as expansive [Char]s. Changing addFile to:
addFile :: FiniteMap String Int -> String -> IO (FiniteMap String Int) addFile fm name = do h <- openFile name ReadMode x <- hGetContents h let f = addHist fm x hClose (f `seq` h) -- thanks to Peter Thiemann return f
With this, I can at least plough through the 13Mb of files comfortably. After fixing a bug in my overlong word elimination code, but a directory with 50Mb is still out of reach. What is the limit on open files, and why? I think it'd be nice to just schedule a huge amount of IO operations, and have them all be performed when required (i.e. when the FM is first accessed). Apparently, my addDir took the trouble to open the files, but not generate the FMs -- any idea why? Finally, the nice thing about Haskell is that the frequency counting code was written and tested in less than five minutes. The downside is that getting the IO and strictness right, took around five hours. :-/ -kzm -- If I haven't seen further, it is by standing in the footprints of giants

Yes, getting the right amount of strictness--and in the right places--can be tricky. For your program, it seems you should process each file completely (reflecting its contents strictly in your histogram so the contents can be dropped and the file descriptor closed) before moving on to the next file. This would reduce the number of file descriptors needed to one and limit memory requirements to O(m), where m is the maximum size of a file. In addition, you should process each file incrementally, reducing the memory requirements to O(1). Specific comments follow. "Ketil Z. Malde" wrote:
Dean Herington
writes: "Ketil Z. Malde" wrote:
-- | add data from a file to the histogram addFile :: FiniteMap String Int -> String -> IO (FiniteMap String Int) addFile fm name = do x <- readFile name return (addHist fm x)
I changed this to read x strictly, but it turned out that wasn't quite enough. See below.
Why read a file's contents strictly? What's important is to tabulate its contribution to the histogram strictly. Perhaps return $! addHist fm x might be enough to achieve that (*).
-- | add data from all files in a directory to the histogram addDir :: FiniteMap String Int -> String -> IO (FiniteMap String Int) addDir fm dir = do dc <- getDirectoryContents dir fs <- filterM doesFileExist (map ((dir++"/")++) dc) foldM addFile fm fs
addHist :: FiniteMap String Int -> String -> FiniteMap String Int addHist fm = foldl add1 fm . words where add1 f w = case lookupFM f w of Just n -> addToFM f w (n+1) Nothing -> addToFM f w 1
You should do the counting strictly: Just n -> case n+1 of n1 -> addToFM f w n1 Your current limitation on the size of a directory that your program can handle may be due to these unevaluated thunks representing (n+1) calculations.
The string won't be evaluated until the finite map is demanded. None of the code you've shown so far makes that demand.
[I'll leave my ramblings here, skip to the ==== for the solution]
Right. But if I do
f <- addFile emptyFM "foo"
then no IO should be performed either, right?
Right, except that, as Simon M. mentioned, the file is opened so that any opening exceptions are triggered.
And when I later do
print (head . fmToList f)
the whole FM needs to be computed, no?
Perhaps. You're only demanding the head of the list. Conceivably, the FM logic might be able to determine the lowest key/element pair without evaluating the entire map.
If this is unsound (too lazy) in any way, I don't see it. How much memory can I expect an FM to consume, anyway? I would expect something like
(log n * sizeof(branch node)) + n*sizeof(key+element)
Which should be roughly equivalent to n*sizeof(key+element), which in my case (building a frequency count for words) should be roughly equivalent to the number of different words times their average length (times 8 bytes per character in a list). Is this far off?
Experiments show that doing x <- addDir emptyFM on a directory containing 13Mb of files, the process grows to about 400Mb. Surprisingly, asking for the head requires a lot more memory and takes considerable time.
=====
So, what's really happening? Well, it seems the files were read strictly, but the FMs weren't constructed yet. So the result was a lot of files residing in memory, stored as expansive [Char]s. Changing addFile to:
addFile :: FiniteMap String Int -> String -> IO (FiniteMap String Int) addFile fm name = do h <- openFile name ReadMode x <- hGetContents h let f = addHist fm x hClose (f `seq` h) -- thanks to Peter Thiemann return f
I find the above approach a bit risky, as you are closing the file after having only shallowly demanded the result of addHist. My earlier suggestion, return $! addHist fm x, makes exactly the same shallow demand, but if that demand is insufficient, loses performance but not correctness. I would recommend letting the runtime system close the file and just being careful to read to end of file before moving on to the next file.
With this, I can at least plough through the 13Mb of files comfortably. After fixing a bug in my overlong word elimination code, but a directory with 50Mb is still out of reach.
When you get the strictness "right", you should suffer no such limit.
What is the limit on open files, and why? I think it'd be nice to just schedule a huge amount of IO operations, and have them all be performed when required (i.e. when the FM is first accessed). Apparently, my addDir took the trouble to open the files, but not generate the FMs -- any idea why?
Finally, the nice thing about Haskell is that the frequency counting code was written and tested in less than five minutes. The downside is that getting the IO and strictness right, took around five hours. :-/
(*) I've had problems myself with the use of finite maps without sufficient strictness. It seems that perhaps the FiniteMap interface should provide some explicit support for strictness. Anybody have thoughts on that? -- Dean

Dean Herington
Yes, getting the right amount of strictness--and in the right places--can be tricky.
Tell me about it!
You should do the counting strictly:
Just n -> case n+1 of n1 -> addToFM f w n1
Thanks for the tip. Just performing this change didn't perceptibly change anything, though. It seems to be working all right now, but to my amazement, it may have been upgrading to 5.04.2 that did it?! (I had a lot of heapCensus trouble) Is this at all possible?
Right, except that, as Simon M. mentioned, the file is opened so that any opening exceptions are triggered.
Yeah. I keep forgetting the IO monad is imperative.
Perhaps. You're only demanding the head of the list. Conceivably, the FM logic might be able to determine the lowest key/element pair without evaluating the entire map.
I find the above approach a bit risky, as you are closing the file after having only shallowly demanded the result of addHist. My earlier suggestion, return $! addHist fm x, makes exactly the same shallow demand, but if that demand is insufficient, loses performance but not correctness.
..but then I could equally well use readFile, couldn't I? Anyway, thanks for all the suggestions, all of you. -kzm -- If I haven't seen further, it is by standing in the footprints of giants

Just a quick status report, and to note a couple of lessons learned: Things work adequately, as far as I can tell. I can now process heaps of data, without blowing up anything. Appears to be faster than spam-stat.el, at least, although I haven't measured. I'm back to using "readFile" for file IO, and it works nicely, as long as I make sure all the file is processed. I think this is a good way of processing large amounts of data (where the processing reduces the data size), reading the entire file into memory strictly is quickly going to be too costly (expanded to linked lists of unicode, ugh) Don't trust finiteMap to evaluate anything. I have evidence one of the major space leaks was FM only evaluating the strings used as keys to the point they were proved unique. (Is this right?) Strictifying the strings helped a lot. One question though, about hFlush. I print out the status by repeatedly putStr'ing "blah blah \r". With NoBuffering set, it works, but when following the putStr with 'hFlush stdout', it doesn't (only outputs very sporadically. I guess I'm misunderstanding the function of hFlush, anybody care to elaborate?) And a final lesson, unlike cockroaches, computer bugs hide in light as well in the darkness. One bug in the very trivial token parsing code caused a lot of words that should have been ignored to be included. Thanks to everybody who helped out. -kzm -- If I haven't seen further, it is by standing in the footprints of giants

These things are always tricky to understand, which is why I recommend not using lazy I/O. File reading is not a pure operation: running out of file descriptors is a good counter-example.
Without saying wether I agree with lazy I/O or not, I suggest that this particular problem may also be due to the runtime implementation: If one tries to open a new file but the file descriptors are exhausted, the runtime system should first trigger a full GC because that might free up some descriptors -- but maybe GHC is already doing this? This brings me to another "issue". It would be nice if we can create foreign pointers with a resource estimate. That is, a foreign pointer to a bitmap might have a high resource estimate. This resource estimate can be used by the garbage collector to collect more aggressively if the "foreign" resources become filled. I haven't fleshed out a complete design yet, but it should be somehow possible to express that file-descriptors have a rather high resource estimate, as they run out after about 200 have been allocated. All the best, Daan.
Cheers, Simon _______________________________________________ Haskell mailing list Haskell@haskell.org http://www.haskell.org/mailman/listinfo/haskell
participants (4)
-
Daan Leijen
-
Dean Herington
-
ketil@ii.uib.no
-
Simon Marlow