
On Tuesday 15 February 2011 20:15:54, Andrew Coppin wrote:
I tried -O2 -fno-cse. No difference.
I also tried -O2 -fno-full-laziness. BIG DIFFERENCE.
See also the very old GHC ticket at http://hackage.haskell.org/trac/ghc/ticket/917
I don't know if that's the problem or not, but it might plausibly be.
Here's the smallest version of the program that I could come up with [which still misbehaves]:
module Main (main) where
import System.IO import System.Random
main = do file_batch "01-Uniform" random_byte_uniform
random_byte_uniform :: IO Int random_byte_uniform = randomRIO (0x00, 0xFF)
random_file :: String -> Int -> IO Int -> IO () random_file f n rnd = do putStrLn $ "Save: " ++ f ++ " [" ++ show n ++ " bytes]" h <- openFile f WriteMode hSetBinaryMode h True mapM_ (\ _ -> rnd >>= hPutChar h . toEnum) [1..n] hClose h
file_batch :: String -> IO Int -> IO () file_batch f rnd = mapM_ (\ k -> mapM_ (\ n -> random_file (f ++ "-" ++ show k ++ "x-" ++ [n]) (10 * 1024 * 1024 * k) rnd ) "ABCD" ) [1..4]
If main calls random_file directly, the program seems to work OK, so the problem seems to be file_batch. Maybe.
Or, one could say, the problem is the export list :) If you remove the export list, so that random_file is exported, the leak disappears (at least with 7.0.1, didn't test 6.12). If nothing but main is exported, GHC can be much more aggressive with inlining, and it is. The result is that the list [1 .. 10*1024*1024*k] from the penultimate line of random_file is shared between the four iterations of the inner loop in file_batch (for k = 1 .. 4). Oops. If random_file is exported, it is too big to inline it (at least if you don't specifically ask for it), so you get no sharing (even better, GHC rewrites mapM_ (\ _ -> rnd >>= hPutChar h . toEnum) [1..n] to a nice loop, the list isn't constructed at all).
I don't really know. I had a go at playing with -ddump-simpl, but that just generated a 8 KB file which is utterly incomprehensible. (Well, the -O0 variant is just about comprehensible. The -O2 variant isn't.
You have to look for interesting stuff (in this case the list [1 .. n]) and note its identifier (yes, coping with the identifiers in core is hard, especially when they are entirely compiler-generated and don't start with a source-code name), then see how it is used.
But it appears that *everything* gets inlined into main...)
That's kind of the point of module Main (main) where Sometimes that's good, other times not.
If anybody can figure out what's happening here, I'd be interested to know.
Due to the extensive inlining, GHC sees that some values are reused, so it decides to share those values instead of recomputing them. Unfortunately, those values are long lists. Making GHC look at smaller chunks of the code prevents that, as does turning off full-laziness (in both cases the let-binding of the list doesn't get floated out of random_file, that floating [more precisely, the resulting sharing] is what causes the leak). Which makes me wonder: unwanted sharing of lists [1 .. n] or similar is a frequent cause of space leaks, so would it be possible to teach GHC to not share such lists (unless they're bound to a name to indicate sharing is wanted)? In particular for enumerations [a .. b] of type [Int], [Integer] or similar, I'm pretty sure that the cost of recomputation is far outweighed by the memory consumption of sharing in almost all cases.