
My question on the ghc heap profiler on stack overflow: http://stackoverflow.com/questions/5306717/how-should-i-interpret-the-output... remains unanswered :-( Perhaps that's not the best forum. Is there someone here prepared to explain how the memory usage in the heap profiler relates to the "Live Bytes" count shown in the garbage collection statistics? Many thanks, Tim

On Mon, Mar 21, 2011 at 9:59 AM, Tim Docker
My question on the ghc heap profiler on stack overflow:
http://stackoverflow.com/questions/5306717/how-should-i-interpret-the-output...
remains unanswered :-( Perhaps that's not the best forum. Is there someone here prepared to explain how the memory usage in the heap profiler relates to the "Live Bytes" count shown in the garbage collection statistics?
I've long had the same questions and still don't know the answers, so I'm curious too. My guess was that the heap graph has the profiling overhead subtracted out, but I don't know really. The numbers don't line up with +RTS -s output, but they seem to go up and down by the same amount. Also, the time axis clearly isn't wall time. I'm guessing CPU time, but once again, it doesn't line up with getCPUTime output. It's actually kind of annoying because to get a steady state picture of heap usage the only thing I could think of was to write a several second busy loop and look for the flat spot.

On Mon, Mar 21, 2011 at 9:59 AM, I wrote:
My question on the ghc heap profiler on stack overflow:
http://stackoverflow.com/questions/5306717/how-should-i-interpret-the-output...
remains unanswered :-( Perhaps that's not the best forum. Is there someone here prepared to explain how the memory usage in the heap profiler relates to the "Live Bytes" count shown in the garbage collection statistics?
I've made a little progress on this. I've simplified my program down to a simple executable that loads a bunch of data into an in-memory map, and then writes it out again. I've added calls to `seq` to ensure that laziness is not causing excessing memory consumption. When I run this on my sample data set, it takes ~7 cpu seconds, and uses ~120 MB of vm An equivalent python script, takes ~2 secs and ~19MB of vm :-(. The code is below. I'm mostly concerned with the memory usage rather than performance at this stage. What is interesting, is that when I turn on garbage collection statistics (+RTS -s), I see this: 10,089,324,996 bytes allocated in the heap 201,018,116 bytes copied during GC 12,153,592 bytes maximum residency (8 sample(s)) 59,325,408 bytes maximum slop 114 MB total memory in use (1 MB lost due to fragmentation) Generation 0: 19226 collections, 0 parallel, 1.59s, 1.64selapsed Generation 1: 8 collections, 0 parallel, 0.04s, 0.04selapsed INIT time 0.00s ( 0.00s elapsed) MUT time 5.84s ( 5.96s elapsed) GC time 1.63s ( 1.68s elapsed) EXIT time 0.00s ( 0.00s elapsed) Total time 7.47s ( 7.64s elapsed) %GC time 21.8% (22.0% elapsed) Alloc rate 1,726,702,840 bytes per MUT second Productivity 78.2% of total user, 76.5% of total elapsed This seems strange. The maximum residency of 12MB sounds about correct for my data. But what's with the 59MB of "slop"? According to the ghc docs: | The "bytes maximum slop" tells you the most space that is ever wasted | due to the way GHC allocates memory in blocks. Slop is memory at the | end of a block that was wasted. There's no way to control this; we | just like to see how much memory is being lost this way. There's this page also: http://hackage.haskell.org/trac/ghc/wiki/Commentary/Rts/Storage/Slop but it doesn't really make things clearer for me. Is the slop number above likely to be a significant contribution to net memory usage? Are there any obvious reasons why the code below could be generating so much? The data file in question has 61k lines, and is <6MB in total. Thanks, Tim -------- Map2.hs -------------------------------------------- module Main where import qualified Data.Map as Map import qualified Data.ByteString.Char8 as BS import System.Environment import System.IO type MyMap = Map.Map BS.ByteString BS.ByteString foldLines :: (a -> String -> a) -> a -> Handle -> IO a foldLines f a h = do eof <- hIsEOF h if eof then (return a) else do l <- hGetLine h let a' = f a l a' `seq` foldLines f a' h undumpFile :: FilePath -> IO MyMap undumpFile path = do h <- openFile path ReadMode m <- foldLines addv Map.empty h hClose h return m where addv m "" = m addv m s = let (k,v) = readKV s in k `seq` v `seq` Map.insert k v m readKV s = let (ks,vs) = read s in (BS.pack ks, BS.pack vs) dump :: [(BS.ByteString,BS.ByteString)] -> IO () dump vs = mapM_ putV vs where putV (k,v) = putStrLn (show (BS.unpack k, BS.unpack v)) main :: IO () main = do args <- getArgs case args of [path] -> do v <- undumpFile path dump (Map.toList v) return ()

On Wednesday 23 March 2011 03:32:16, Tim Docker wrote:
On Mon, Mar 21, 2011 at 9:59 AM, I wrote:
My question on the ghc heap profiler on stack overflow:
http://stackoverflow.com/questions/5306717/how-should-i-interpret-the- output-of-the-ghc-heap-profiler
remains unanswered :-( Perhaps that's not the best forum. Is there someone here prepared to explain how the memory usage in the heap profiler relates to the "Live Bytes" count shown in the garbage collection statistics?
I've made a little progress on this. I've simplified my program down to a simple executable that loads a bunch of data into an in-memory map, and then writes it out again. I've added calls to `seq` to ensure that laziness is not causing excessing memory consumption. When I run this on my sample data set, it takes ~7 cpu seconds, and uses ~120 MB of vm An equivalent python script, takes ~2 secs and ~19MB of vm :-(.
The code is below. I'm mostly concerned with the memory usage rather than performance at this stage. What is interesting, is that when I turn on garbage collection statistics (+RTS -s), I see this:
10,089,324,996 bytes allocated in the heap 201,018,116 bytes copied during GC 12,153,592 bytes maximum residency (8 sample(s)) 59,325,408 bytes maximum slop 114 MB total memory in use (1 MB lost due to fragmentation)
Generation 0: 19226 collections, 0 parallel, 1.59s, 1.64selapsed Generation 1: 8 collections, 0 parallel, 0.04s, 0.04selapsed
INIT time 0.00s ( 0.00s elapsed) MUT time 5.84s ( 5.96s elapsed) GC time 1.63s ( 1.68s elapsed) EXIT time 0.00s ( 0.00s elapsed) Total time 7.47s ( 7.64s elapsed)
%GC time 21.8% (22.0% elapsed)
Alloc rate 1,726,702,840 bytes per MUT second
Productivity 78.2% of total user, 76.5% of total elapsed
This seems strange. The maximum residency of 12MB sounds about correct
for my data. But what's with the 59MB of "slop"? According to the ghc docs: | The "bytes maximum slop" tells you the most space that is ever wasted | due to the way GHC allocates memory in blocks. Slop is memory at the | end of a block that was wasted. There's no way to control this; we | just like to see how much memory is being lost this way.
There's this page also:
http://hackage.haskell.org/trac/ghc/wiki/Commentary/Rts/Storage/Slop
but it doesn't really make things clearer for me.
Is the slop number above likely to be a significant contribution to net memory usage?
Yes, absolutely.
Are there any obvious reasons why the code below could be generating so much?
I suspect packing a lot of presumably relatively short ByteStrings would generate (the lion's share of) the slop. I'm not familiar with the internals, though, so I don't know where GHC would put a newPinnedByteArray# (which is where your ByteString contents is), what alignement requirements those have.
The data file in question has 61k lines, and is <6MB in total.
Thanks,
Tim
-------- Map2.hs --------------------------------------------
module Main where
import qualified Data.Map as Map import qualified Data.ByteString.Char8 as BS import System.Environment import System.IO
type MyMap = Map.Map BS.ByteString BS.ByteString
foldLines :: (a -> String -> a) -> a -> Handle -> IO a foldLines f a h = do eof <- hIsEOF h if eof then (return a) else do l <- hGetLine h let a' = f a l a' `seq` foldLines f a' h
undumpFile :: FilePath -> IO MyMap undumpFile path = do h <- openFile path ReadMode m <- foldLines addv Map.empty h hClose h return m where addv m "" = m addv m s = let (k,v) = readKV s in k `seq` v `seq` Map.insert k v m
readKV s = let (ks,vs) = read s in (BS.pack ks, BS.pack vs)
It might be better to read the file in one go and construct the map in pure code (foldl' addv Map.empty $ lines filecontents). Also, it will probably be better to do everything on ByteStrings. The file format seems to be ("key","value") on each line, with possible whitespace and empty lines. If none of the keys or values may contain a '\"', undumpFile path = do contents <- BS.readFile path return $! foldl' addv Map.empty (BS.lines contents) where addv m s | BS.null s = m | otherwise = case BS.split '"' s of (_ : k : _ : v : _) -> Map.insert k v m _ -> error "malformed line" should perform much better. If a key or value may contain '"', it's more complicated, using a regex library to split might be a good option then.
dump :: [(BS.ByteString,BS.ByteString)] -> IO () dump vs = mapM_ putV vs where putV (k,v) = putStrLn (show (BS.unpack k, BS.unpack v))
main :: IO () main = do args <- getArgs case args of [path] -> do v <- undumpFile path dump (Map.toList v) return ()

On 22/03/11 05:33, Daniel Fischer wrote:
On Wednesday 23 March 2011 03:32:16, Tim Docker wrote:
Is the slop number above likely to be a significant contribution to net memory usage?
Yes, absolutely.
Are there any obvious reasons why the code below could be generating so much?
I suspect packing a lot of presumably relatively short ByteStrings would generate (the lion's share of) the slop. I'm not familiar with the internals, though, so I don't know where GHC would put a newPinnedByteArray# (which is where your ByteString contents is), what alignement requirements those have.
Thanks, I'm aware that that the code could be optimised eg by sticking to bytestrings and avoiding Strings and read - there were just to make the example simple. I expected this would affect speed, though not memory usage. I'm a bit shocked at the amount of wasted memory here. The sample data file has ~61k key/value pair. Hence ~122k ByteStrings - as you point out many of these are very small (1500 of them are empty). Assuming it's the bytestring that are generating slop, I am seeing ~500 bytes on average per bytestring! Tim

On Tue, March 22, 2011 21:00:29 Tim Docker
wrote:
I'm a bit shocked at the amount of wasted memory here. The sample data file has ~61k key/value pair. Hence ~122k ByteStrings - as you point out many of these are very small (1500 of them are empty). Assuming it's the bytestring that are generating slop, I am seeing ~500 bytes on average per bytestring!
It sounds like the space is allocated but unused pages. Unless you have messed with some kernel memory manager settings, unused virtual pages consume no physical RAM. You could confirm this by using ps to check how much RSS is actually used, compared to VSZ allocated (VSZ - RSS shouldn't include any actual data unless your system is actively swapping stuff to disk). If it is just unsued pages it's not a problem. Brandon

On 22/03/2011 16:47, Brandon Moore wrote:
On Tue, March 22, 2011 21:00:29 Tim Docker
wrote: I'm a bit shocked at the amount of wasted memory here. The sample data file has ~61k key/value pair. Hence ~122k ByteStrings - as you point out many of these are very small (1500 of them are empty). Assuming it's the bytestring that are generating slop, I am seeing ~500 bytes on average per bytestring!
It sounds like the space is allocated but unused pages. Unless you have messed with some kernel memory manager settings, unused virtual pages consume no physical RAM. You could confirm this by using ps to check how much RSS is actually used, compared to VSZ allocated (VSZ - RSS shouldn't include any actual data unless your system is actively swapping stuff to disk). If it is just unsued pages it's not a problem.
GHC never allocates more than 1MB above what it needs at any given time. If the memory usage of the program spikes, then unused pages are returned at the next GC. Cheers, Simon

On 22/03/11 10:47, Brandon Moore wrote:
It sounds like the space is allocated but unused pages. Unless you have messed with some kernel memory manager settings, unused virtual pages consume no physical RAM. You could confirm this by using ps to check how much RSS is actually used, compared to VSZ allocated (VSZ - RSS shouldn't include any actual data unless your system is actively swapping stuff to disk). If it is just unsued pages it's not a problem.
Thanks. I've looked at this, and can confirm that the reported VSZ and RSS are almost the same (120MB and 116MB). I think this means that the observed memory usage is real. Tim

On Wed, Mar 23, 2011 at 9:32 AM, Tim Docker
Productivity 78.2% of total user, 76.5% of total elapsed
As a rule of thumb GC time should be less than 10%.
This seems strange. The maximum residency of 12MB sounds about correct for my data. But what's with the 59MB of "slop"? According to the ghc docs:
| The "bytes maximum slop" tells you the most space that is ever wasted | due to the way GHC allocates memory in blocks. Slop is memory at the | end of a block that was wasted. There's no way to control this; we | just like to see how much memory is being lost this way.
GHC requests memory from the OS in large blocks. This makes GC more efficient. The program might not end up using all the allocated memory in the end.
type MyMap = Map.Map BS.ByteString BS.ByteString
Try using HashMap from the unordered-collections package. It's typically 2-3x faster than Map for ByteString/Text keys.
foldLines :: (a -> String -> a) -> a -> Handle -> IO a foldLines f a h = do eof <- hIsEOF h if eof then (return a) else do l <- hGetLine h let a' = f a l a' `seq` foldLines f a' h
Your foldLines is not strict enough. Consider what happens if you call foldLines someF undefined someHandle when the file is empty. If foldLines was strict in the accumulator you'd expect the program to crash (from evaluating undefined), but it doesn't as 'return a' doesn't force 'a'. In addition, you'd like GHC to inline foldLines so the indirect function call to 'f' can be turned to a call to a known function. Here's a better definition: foldLines :: (a -> String -> a) -> a -> Handle -> IO a foldLines f a0 !h = go a0 where go !a = do eof <- hIsEOF h if eof then (return a) else do l <- hGetLine h go (f a l) {-# INLINE foldLines #-} Also, as others have mentioned, String is no good. Use ByteString and Text. Both come with functions to read lines (if I recall correctly). Johan

On 23/03/2011 02:32, Tim Docker wrote:
On Mon, Mar 21, 2011 at 9:59 AM, I wrote:
My question on the ghc heap profiler on stack overflow:
http://stackoverflow.com/questions/5306717/how-should-i-interpret-the-output...
remains unanswered :-( Perhaps that's not the best forum. Is there someone here prepared to explain how the memory usage in the heap profiler relates to the "Live Bytes" count shown in the garbage collection statistics?
I've made a little progress on this. I've simplified my program down to a simple executable that loads a bunch of data into an in-memory map, and then writes it out again. I've added calls to `seq` to ensure that laziness is not causing excessing memory consumption. When I run this on my sample data set, it takes ~7 cpu seconds, and uses ~120 MB of vm An equivalent python script, takes ~2 secs and ~19MB of vm :-(.
The code is below. I'm mostly concerned with the memory usage rather than performance at this stage. What is interesting, is that when I turn on garbage collection statistics (+RTS -s), I see this:
10,089,324,996 bytes allocated in the heap 201,018,116 bytes copied during GC 12,153,592 bytes maximum residency (8 sample(s)) 59,325,408 bytes maximum slop 114 MB total memory in use (1 MB lost due to fragmentation)
Generation 0: 19226 collections, 0 parallel, 1.59s, 1.64selapsed Generation 1: 8 collections, 0 parallel, 0.04s, 0.04selapsed
INIT time 0.00s ( 0.00s elapsed) MUT time 5.84s ( 5.96s elapsed) GC time 1.63s ( 1.68s elapsed) EXIT time 0.00s ( 0.00s elapsed) Total time 7.47s ( 7.64s elapsed)
%GC time 21.8% (22.0% elapsed)
Alloc rate 1,726,702,840 bytes per MUT second
Productivity 78.2% of total user, 76.5% of total elapsed
This seems strange. The maximum residency of 12MB sounds about correct for my data. But what's with the 59MB of "slop"? According to the ghc docs:
| The "bytes maximum slop" tells you the most space that is ever wasted | due to the way GHC allocates memory in blocks. Slop is memory at the | end of a block that was wasted. There's no way to control this; we | just like to see how much memory is being lost this way.
There's this page also:
http://hackage.haskell.org/trac/ghc/wiki/Commentary/Rts/Storage/Slop
but it doesn't really make things clearer for me.
I think the slop figure might be inaccurate when there are lots of ByteStrings floating around, due to the way the garbage collector handles pinned objects (which ByteStrings are). I'll take a look at this sometime. Cheers, Simon

On 23/03/2011 02:32, Tim Docker wrote:
The code is below. I'm mostly concerned with the memory usage rather than performance at this stage. What is interesting, is that when I turn on garbage collection statistics (+RTS -s), I see this:
10,089,324,996 bytes allocated in the heap 201,018,116 bytes copied during GC 12,153,592 bytes maximum residency (8 sample(s)) 59,325,408 bytes maximum slop 114 MB total memory in use (1 MB lost due to fragmentation)
Generation 0: 19226 collections, 0 parallel, 1.59s, 1.64selapsed Generation 1: 8 collections, 0 parallel, 0.04s, 0.04selapsed
INIT time 0.00s ( 0.00s elapsed) MUT time 5.84s ( 5.96s elapsed) GC time 1.63s ( 1.68s elapsed) EXIT time 0.00s ( 0.00s elapsed) Total time 7.47s ( 7.64s elapsed)
%GC time 21.8% (22.0% elapsed)
Alloc rate 1,726,702,840 bytes per MUT second
Productivity 78.2% of total user, 76.5% of total elapsed
This seems strange. The maximum residency of 12MB sounds about correct for my data. But what's with the 59MB of "slop"?
I made some changes to the storage manager in the runtime today, and fixed the slop problem with your program. Here it is after the changes: 14,928,031,040 bytes allocated in the heap 313,542,200 bytes copied during GC 18,044,096 bytes maximum residency (7 sample(s)) 294,256 bytes maximum slop 37 MB total memory in use (0 MB lost due to fragmentation) INIT time 0.00s ( 0.00s elapsed) MUT time 6.38s ( 6.39s elapsed) GC time 1.26s ( 1.26s elapsed) EXIT time 0.00s ( 0.00s elapsed) Total time 7.64s ( 7.64s elapsed) I think this is with a different workload than the one you used above. Before the change I was getting 15,652,646,680 bytes allocated in the heap 312,402,760 bytes copied during GC 17,913,816 bytes maximum residency (9 sample(s)) 111,424,792 bytes maximum slop 142 MB total memory in use (0 MB lost due to fragmentation) INIT time 0.00s ( 0.00s elapsed) MUT time 8.01s ( 8.02s elapsed) GC time 16.86s ( 16.89s elapsed) EXIT time 0.00s ( 0.00s elapsed) Total time 24.88s ( 24.91s elapsed) (GHC 7.0.3 on x86-64/Linux) So, a pretty dramatic improvement. I'm validating the patch right now, it should be in 7.2.1. Cheers, Simon
-------- Map2.hs --------------------------------------------
module Main where
import qualified Data.Map as Map import qualified Data.ByteString.Char8 as BS import System.Environment import System.IO
type MyMap = Map.Map BS.ByteString BS.ByteString
foldLines :: (a -> String -> a) -> a -> Handle -> IO a foldLines f a h = do eof <- hIsEOF h if eof then (return a) else do l <- hGetLine h let a' = f a l a' `seq` foldLines f a' h
undumpFile :: FilePath -> IO MyMap undumpFile path = do h <- openFile path ReadMode m <- foldLines addv Map.empty h hClose h return m where addv m "" = m addv m s = let (k,v) = readKV s in k `seq` v `seq` Map.insert k v m
readKV s = let (ks,vs) = read s in (BS.pack ks, BS.pack vs)
dump :: [(BS.ByteString,BS.ByteString)] -> IO () dump vs = mapM_ putV vs where putV (k,v) = putStrLn (show (BS.unpack k, BS.unpack v))
main :: IO () main = do args <- getArgs case args of [path] -> do v <- undumpFile path dump (Map.toList v) return ()
_______________________________________________ Glasgow-haskell-users mailing list Glasgow-haskell-users@haskell.org http://www.haskell.org/mailman/listinfo/glasgow-haskell-users

On 14/04/2011, at 6:24 PM, Simon Marlow wrote:
I made some changes to the storage manager in the runtime today, and fixed the slop problem with your program. Here it is after the changes:
14,928,031,040 bytes allocated in the heap 313,542,200 bytes copied during GC 18,044,096 bytes maximum residency (7 sample(s)) 294,256 bytes maximum slop 37 MB total memory in use (0 MB lost due to fragmentation)
INIT time 0.00s ( 0.00s elapsed) MUT time 6.38s ( 6.39s elapsed) GC time 1.26s ( 1.26s elapsed) EXIT time 0.00s ( 0.00s elapsed) Total time 7.64s ( 7.64s elapsed)
I think this is with a different workload than the one you used above. Before the change I was getting
15,652,646,680 bytes allocated in the heap 312,402,760 bytes copied during GC 17,913,816 bytes maximum residency (9 sample(s)) 111,424,792 bytes maximum slop 142 MB total memory in use (0 MB lost due to fragmentation)
INIT time 0.00s ( 0.00s elapsed) MUT time 8.01s ( 8.02s elapsed) GC time 16.86s ( 16.89s elapsed) EXIT time 0.00s ( 0.00s elapsed) Total time 24.88s ( 24.91s elapsed)
(GHC 7.0.3 on x86-64/Linux) So, a pretty dramatic improvement. I'm validating the patch right now, it should be in 7.2.1.
This looks really promising. Let me know when the patch is available, and I'll try it out on my real code. Thanks, Tim

On 19/04/2011 14:41, Tim Docker wrote:
On 14/04/2011, at 6:24 PM, Simon Marlow wrote:
I made some changes to the storage manager in the runtime today, and fixed the slop problem with your program. Here it is after the changes:
14,928,031,040 bytes allocated in the heap 313,542,200 bytes copied during GC 18,044,096 bytes maximum residency (7 sample(s)) 294,256 bytes maximum slop 37 MB total memory in use (0 MB lost due to fragmentation)
INIT time 0.00s ( 0.00s elapsed) MUT time 6.38s ( 6.39s elapsed) GC time 1.26s ( 1.26s elapsed) EXIT time 0.00s ( 0.00s elapsed) Total time 7.64s ( 7.64s elapsed)
I think this is with a different workload than the one you used above. Before the change I was getting
15,652,646,680 bytes allocated in the heap 312,402,760 bytes copied during GC 17,913,816 bytes maximum residency (9 sample(s)) 111,424,792 bytes maximum slop 142 MB total memory in use (0 MB lost due to fragmentation)
INIT time 0.00s ( 0.00s elapsed) MUT time 8.01s ( 8.02s elapsed) GC time 16.86s ( 16.89s elapsed) EXIT time 0.00s ( 0.00s elapsed) Total time 24.88s ( 24.91s elapsed)
(GHC 7.0.3 on x86-64/Linux) So, a pretty dramatic improvement. I'm validating the patch right now, it should be in 7.2.1.
This looks really promising. Let me know when the patch is available, and I'll try it out on my real code.
The change is already in: http://hackage.haskell.org/trac/ghc/changeset/cc2ea98ac4a15e40a15e89de9e47f3... You can build GHC yourself from the git repositories, download a snapshot, or wait for 7.2.1. Cheers, Simon
participants (6)
-
Brandon Moore
-
Daniel Fischer
-
Evan Laforge
-
Johan Tibell
-
Simon Marlow
-
Tim Docker