Is evacuate for StgMutArrPtrs and StgArrPtrs expensive to GC?

Hi, When I benchmark Data.HashMap.insert from unordered-containers (inserting the keys [0..10000]) the runtime is dominated by GC: $ cat Test.hs module Main where import Control.DeepSeq import Control.Exception import Control.Monad import qualified Data.HashMap.Strict as HM import Data.List (foldl') main = do let ks = [0..10000] :: [Int] evaluate (rnf ks) forM_ ([0..1000] :: [Int]) $ \ x -> do evaluate $ HM.null $ foldl' (\ m k -> HM.insert k x m) HM.empty ks $ perf record -g ./Test +RTS -s 6,187,678,112 bytes allocated in the heap 3,309,887,128 bytes copied during GC 1,299,200 bytes maximum residency (1002 sample(s)) 118,816 bytes maximum slop 5 MB total memory in use (0 MB lost due to fragmentation) Tot time (elapsed) Avg pause Max pause Gen 0 11089 colls, 0 par 1.31s 1.30s 0.0001s 0.0005s Gen 1 1002 colls, 0 par 0.49s 0.51s 0.0005s 0.0022s INIT time 0.00s ( 0.00s elapsed) MUT time 1.02s ( 1.03s elapsed) GC time 1.80s ( 1.80s elapsed) EXIT time 0.00s ( 0.00s elapsed) Total time 2.82s ( 2.84s elapsed) %GC time 63.7% (63.5% elapsed) Alloc rate 6,042,264,963 bytes per MUT second Productivity 36.3% of total user, 36.1% of total elapsed $ perf report 41.46% Test Test [.] evacuate 15.47% Test Test [.] scavenge_block 11.04% Test Test [.] s3cN_info 8.74% Test Test [.] s3aZ_info 3.59% Test Test [.] 0x7ff5 2.83% Test Test [.] scavenge_mut_arr_ptrs 2.69% Test libc-2.15.so [.] 0x147fd9 2.51% Test Test [.] allocate 2.00% Test Test [.] s3oo_info 0.91% Test Test [.] todo_block_full 0.87% Test Test [.] hs_popcnt64 0.80% Test Test [.] s3en_info 0.62% Test Test [.] s3el_info Is GC:ing StgMutArrPtrs and StgArrPtrs, which I create a lot of, more expensive than GC:ing normal heap objects (i.e. for standard data types)? -- Johan

The code for 'allocate' in rts/sm/Storage.c doesn't seem that
expensive. An extra branch compared to inline allocation and
allocation is done in the next nursery block (risking fragmentation?).
-- Johan
On Mon, Sep 30, 2013 at 9:50 PM, Johan Tibell
Hi,
When I benchmark Data.HashMap.insert from unordered-containers (inserting the keys [0..10000]) the runtime is dominated by GC:
$ cat Test.hs module Main where
import Control.DeepSeq import Control.Exception import Control.Monad import qualified Data.HashMap.Strict as HM import Data.List (foldl')
main = do let ks = [0..10000] :: [Int] evaluate (rnf ks) forM_ ([0..1000] :: [Int]) $ \ x -> do evaluate $ HM.null $ foldl' (\ m k -> HM.insert k x m) HM.empty ks
$ perf record -g ./Test +RTS -s 6,187,678,112 bytes allocated in the heap 3,309,887,128 bytes copied during GC 1,299,200 bytes maximum residency (1002 sample(s)) 118,816 bytes maximum slop 5 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause Gen 0 11089 colls, 0 par 1.31s 1.30s 0.0001s 0.0005s Gen 1 1002 colls, 0 par 0.49s 0.51s 0.0005s 0.0022s
INIT time 0.00s ( 0.00s elapsed) MUT time 1.02s ( 1.03s elapsed) GC time 1.80s ( 1.80s elapsed) EXIT time 0.00s ( 0.00s elapsed) Total time 2.82s ( 2.84s elapsed)
%GC time 63.7% (63.5% elapsed)
Alloc rate 6,042,264,963 bytes per MUT second
Productivity 36.3% of total user, 36.1% of total elapsed
$ perf report 41.46% Test Test [.] evacuate 15.47% Test Test [.] scavenge_block 11.04% Test Test [.] s3cN_info 8.74% Test Test [.] s3aZ_info 3.59% Test Test [.] 0x7ff5 2.83% Test Test [.] scavenge_mut_arr_ptrs 2.69% Test libc-2.15.so [.] 0x147fd9 2.51% Test Test [.] allocate 2.00% Test Test [.] s3oo_info 0.91% Test Test [.] todo_block_full 0.87% Test Test [.] hs_popcnt64 0.80% Test Test [.] s3en_info 0.62% Test Test [.] s3el_info
Is GC:ing StgMutArrPtrs and StgArrPtrs, which I create a lot of, more expensive than GC:ing normal heap objects (i.e. for standard data types)?
-- Johan

It's typical for benchmarks that allocate a large data structure to spend a
lot of time in the GC. The data gets copied twice - once in the young
generation and then again when promoted to the old generation. You can
make this kind of benchmark much faster by just using a bigger allocation
area.
There's nothing inherently costly about StgMutArrPtrs compared to other
objects, except that they are variable size and therefore we can't unroll
the copy loop, but I don't think that's a big effect. The actual copying
is the major cost.
The way to improve this kind of benchmark would be to add some heuristics
for varying the nursery size based on the quantity of data retained, for
example. I think there's a lot of room for improvement here, but someone
needs to do some careful benchmarking and experimentation. Andrew Farmer
did some work on this and allegedly got good results but we never saw the
code (hint hint!).
Cheers,
Simon
On 1 October 2013 06:43, Johan Tibell
The code for 'allocate' in rts/sm/Storage.c doesn't seem that expensive. An extra branch compared to inline allocation and allocation is done in the next nursery block (risking fragmentation?).
-- Johan
Hi,
When I benchmark Data.HashMap.insert from unordered-containers (inserting the keys [0..10000]) the runtime is dominated by GC:
$ cat Test.hs module Main where
import Control.DeepSeq import Control.Exception import Control.Monad import qualified Data.HashMap.Strict as HM import Data.List (foldl')
main = do let ks = [0..10000] :: [Int] evaluate (rnf ks) forM_ ([0..1000] :: [Int]) $ \ x -> do evaluate $ HM.null $ foldl' (\ m k -> HM.insert k x m) HM.empty ks
$ perf record -g ./Test +RTS -s 6,187,678,112 bytes allocated in the heap 3,309,887,128 bytes copied during GC 1,299,200 bytes maximum residency (1002 sample(s)) 118,816 bytes maximum slop 5 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max
On Mon, Sep 30, 2013 at 9:50 PM, Johan Tibell
wrote: pause Gen 0 11089 colls, 0 par 1.31s 1.30s 0.0001s 0.0005s Gen 1 1002 colls, 0 par 0.49s 0.51s 0.0005s 0.0022s
INIT time 0.00s ( 0.00s elapsed) MUT time 1.02s ( 1.03s elapsed) GC time 1.80s ( 1.80s elapsed) EXIT time 0.00s ( 0.00s elapsed) Total time 2.82s ( 2.84s elapsed)
%GC time 63.7% (63.5% elapsed)
Alloc rate 6,042,264,963 bytes per MUT second
Productivity 36.3% of total user, 36.1% of total elapsed
$ perf report 41.46% Test Test [.] evacuate 15.47% Test Test [.] scavenge_block 11.04% Test Test [.] s3cN_info 8.74% Test Test [.] s3aZ_info 3.59% Test Test [.] 0x7ff5 2.83% Test Test [.] scavenge_mut_arr_ptrs 2.69% Test libc-2.15.so [.] 0x147fd9 2.51% Test Test [.] allocate 2.00% Test Test [.] s3oo_info 0.91% Test Test [.] todo_block_full 0.87% Test Test [.] hs_popcnt64 0.80% Test Test [.] s3en_info 0.62% Test Test [.] s3el_info
Is GC:ing StgMutArrPtrs and StgArrPtrs, which I create a lot of, more expensive than GC:ing normal heap objects (i.e. for standard data types)?
-- Johan

I did indeed implement dynamic nursery sizing and did some preliminary
benchmarking. The headline figure: 15% speedup on the nofib/gc benchmarks,
though the variance was pretty large, and there were some slowdowns.
My scheme was very simple... I kept track of the size and rough collection
time of the previous three collections and did a sort of crude binary
search to find a minimum in the search space. I did it this way because it
was simple and required constant time and memory to make a decision. Though
one of the conclusions was that collection time was a bad metric, due to
the way the RTS re-uses blocks. As Simon pointed out, tracking retainment
or some other metric would probably be better, but I need to explore it.
Another result: the default size is almost always too small (at least for
the nofib programs). CPUs come with huge caches, and using the RTS flag -A
to set the allocation area to be roughly the size of the L3 cache usually
gave pretty decent speedups.
I did this for a class project, and had to put it down to focus on other
things, and just haven't picked it back up. I still have a patch laying
around, and several pages of notes with ideas for improvement in both the
metric and search. I'm hoping to pick it back up again in a couple months,
with an eye on a workshop paper, and a real patch for 7.10.
On Tue, Oct 1, 2013 at 3:36 AM, Simon Marlow
It's typical for benchmarks that allocate a large data structure to spend a lot of time in the GC. The data gets copied twice - once in the young generation and then again when promoted to the old generation. You can make this kind of benchmark much faster by just using a bigger allocation area.
There's nothing inherently costly about StgMutArrPtrs compared to other objects, except that they are variable size and therefore we can't unroll the copy loop, but I don't think that's a big effect. The actual copying is the major cost.
The way to improve this kind of benchmark would be to add some heuristics for varying the nursery size based on the quantity of data retained, for example. I think there's a lot of room for improvement here, but someone needs to do some careful benchmarking and experimentation. Andrew Farmer did some work on this and allegedly got good results but we never saw the code (hint hint!).
Cheers, Simon
On 1 October 2013 06:43, Johan Tibell
wrote: The code for 'allocate' in rts/sm/Storage.c doesn't seem that expensive. An extra branch compared to inline allocation and allocation is done in the next nursery block (risking fragmentation?).
-- Johan
Hi,
When I benchmark Data.HashMap.insert from unordered-containers (inserting the keys [0..10000]) the runtime is dominated by GC:
$ cat Test.hs module Main where
import Control.DeepSeq import Control.Exception import Control.Monad import qualified Data.HashMap.Strict as HM import Data.List (foldl')
main = do let ks = [0..10000] :: [Int] evaluate (rnf ks) forM_ ([0..1000] :: [Int]) $ \ x -> do evaluate $ HM.null $ foldl' (\ m k -> HM.insert k x m) HM.empty ks
$ perf record -g ./Test +RTS -s 6,187,678,112 bytes allocated in the heap 3,309,887,128 bytes copied during GC 1,299,200 bytes maximum residency (1002 sample(s)) 118,816 bytes maximum slop 5 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max
On Mon, Sep 30, 2013 at 9:50 PM, Johan Tibell
wrote: pause Gen 0 11089 colls, 0 par 1.31s 1.30s 0.0001s 0.0005s Gen 1 1002 colls, 0 par 0.49s 0.51s 0.0005s 0.0022s
INIT time 0.00s ( 0.00s elapsed) MUT time 1.02s ( 1.03s elapsed) GC time 1.80s ( 1.80s elapsed) EXIT time 0.00s ( 0.00s elapsed) Total time 2.82s ( 2.84s elapsed)
%GC time 63.7% (63.5% elapsed)
Alloc rate 6,042,264,963 bytes per MUT second
Productivity 36.3% of total user, 36.1% of total elapsed
$ perf report 41.46% Test Test [.] evacuate 15.47% Test Test [.] scavenge_block 11.04% Test Test [.] s3cN_info 8.74% Test Test [.] s3aZ_info 3.59% Test Test [.] 0x7ff5 2.83% Test Test [.] scavenge_mut_arr_ptrs 2.69% Test libc-2.15.so [.] 0x147fd9 2.51% Test Test [.] allocate 2.00% Test Test [.] s3oo_info 0.91% Test Test [.] todo_block_full 0.87% Test Test [.] hs_popcnt64 0.80% Test Test [.] s3en_info 0.62% Test Test [.] s3el_info
Is GC:ing StgMutArrPtrs and StgArrPtrs, which I create a lot of, more expensive than GC:ing normal heap objects (i.e. for standard data types)?
-- Johan

awesome!
please let us know when some of the info is available publicly, perhaps so
other folks can help out wiht experimentation
On Tue, Oct 1, 2013 at 4:30 PM, Andrew Farmer
I did indeed implement dynamic nursery sizing and did some preliminary benchmarking. The headline figure: 15% speedup on the nofib/gc benchmarks, though the variance was pretty large, and there were some slowdowns.
My scheme was very simple... I kept track of the size and rough collection time of the previous three collections and did a sort of crude binary search to find a minimum in the search space. I did it this way because it was simple and required constant time and memory to make a decision. Though one of the conclusions was that collection time was a bad metric, due to the way the RTS re-uses blocks. As Simon pointed out, tracking retainment or some other metric would probably be better, but I need to explore it. Another result: the default size is almost always too small (at least for the nofib programs). CPUs come with huge caches, and using the RTS flag -A to set the allocation area to be roughly the size of the L3 cache usually gave pretty decent speedups.
I did this for a class project, and had to put it down to focus on other things, and just haven't picked it back up. I still have a patch laying around, and several pages of notes with ideas for improvement in both the metric and search. I'm hoping to pick it back up again in a couple months, with an eye on a workshop paper, and a real patch for 7.10.
On Tue, Oct 1, 2013 at 3:36 AM, Simon Marlow
wrote: It's typical for benchmarks that allocate a large data structure to spend a lot of time in the GC. The data gets copied twice - once in the young generation and then again when promoted to the old generation. You can make this kind of benchmark much faster by just using a bigger allocation area.
There's nothing inherently costly about StgMutArrPtrs compared to other objects, except that they are variable size and therefore we can't unroll the copy loop, but I don't think that's a big effect. The actual copying is the major cost.
The way to improve this kind of benchmark would be to add some heuristics for varying the nursery size based on the quantity of data retained, for example. I think there's a lot of room for improvement here, but someone needs to do some careful benchmarking and experimentation. Andrew Farmer did some work on this and allegedly got good results but we never saw the code (hint hint!).
Cheers, Simon
On 1 October 2013 06:43, Johan Tibell
wrote: The code for 'allocate' in rts/sm/Storage.c doesn't seem that expensive. An extra branch compared to inline allocation and allocation is done in the next nursery block (risking fragmentation?).
-- Johan
Hi,
When I benchmark Data.HashMap.insert from unordered-containers (inserting the keys [0..10000]) the runtime is dominated by GC:
$ cat Test.hs module Main where
import Control.DeepSeq import Control.Exception import Control.Monad import qualified Data.HashMap.Strict as HM import Data.List (foldl')
main = do let ks = [0..10000] :: [Int] evaluate (rnf ks) forM_ ([0..1000] :: [Int]) $ \ x -> do evaluate $ HM.null $ foldl' (\ m k -> HM.insert k x m) HM.empty ks
$ perf record -g ./Test +RTS -s 6,187,678,112 bytes allocated in the heap 3,309,887,128 bytes copied during GC 1,299,200 bytes maximum residency (1002 sample(s)) 118,816 bytes maximum slop 5 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max
On Mon, Sep 30, 2013 at 9:50 PM, Johan Tibell
wrote: pause Gen 0 11089 colls, 0 par 1.31s 1.30s 0.0001s 0.0005s Gen 1 1002 colls, 0 par 0.49s 0.51s 0.0005s 0.0022s
INIT time 0.00s ( 0.00s elapsed) MUT time 1.02s ( 1.03s elapsed) GC time 1.80s ( 1.80s elapsed) EXIT time 0.00s ( 0.00s elapsed) Total time 2.82s ( 2.84s elapsed)
%GC time 63.7% (63.5% elapsed)
Alloc rate 6,042,264,963 bytes per MUT second
Productivity 36.3% of total user, 36.1% of total elapsed
$ perf report 41.46% Test Test [.] evacuate 15.47% Test Test [.] scavenge_block 11.04% Test Test [.] s3cN_info 8.74% Test Test [.] s3aZ_info 3.59% Test Test [.] 0x7ff5 2.83% Test Test [.] scavenge_mut_arr_ptrs 2.69% Test libc-2.15.so [.] 0x147fd9 2.51% Test Test [.] allocate 2.00% Test Test [.] s3oo_info 0.91% Test Test [.] todo_block_full 0.87% Test Test [.] hs_popcnt64 0.80% Test Test [.] s3en_info 0.62% Test Test [.] s3el_info
Is GC:ing StgMutArrPtrs and StgArrPtrs, which I create a lot of, more expensive than GC:ing normal heap objects (i.e. for standard data types)?
-- Johan
_______________________________________________ ghc-devs mailing list ghc-devs@haskell.org http://www.haskell.org/mailman/listinfo/ghc-devs

Definitely... I'm somewhat fully occupied for the next two weeks, but
should be able to dig it out then and organize/share it.
On Oct 1, 2013 3:50 PM, "Carter Schonwald"
awesome!
please let us know when some of the info is available publicly, perhaps so other folks can help out wiht experimentation
On Tue, Oct 1, 2013 at 4:30 PM, Andrew Farmer
wrote: I did indeed implement dynamic nursery sizing and did some preliminary benchmarking. The headline figure: 15% speedup on the nofib/gc benchmarks, though the variance was pretty large, and there were some slowdowns.
My scheme was very simple... I kept track of the size and rough collection time of the previous three collections and did a sort of crude binary search to find a minimum in the search space. I did it this way because it was simple and required constant time and memory to make a decision. Though one of the conclusions was that collection time was a bad metric, due to the way the RTS re-uses blocks. As Simon pointed out, tracking retainment or some other metric would probably be better, but I need to explore it. Another result: the default size is almost always too small (at least for the nofib programs). CPUs come with huge caches, and using the RTS flag -A to set the allocation area to be roughly the size of the L3 cache usually gave pretty decent speedups.
I did this for a class project, and had to put it down to focus on other things, and just haven't picked it back up. I still have a patch laying around, and several pages of notes with ideas for improvement in both the metric and search. I'm hoping to pick it back up again in a couple months, with an eye on a workshop paper, and a real patch for 7.10.
On Tue, Oct 1, 2013 at 3:36 AM, Simon Marlow
wrote: It's typical for benchmarks that allocate a large data structure to spend a lot of time in the GC. The data gets copied twice - once in the young generation and then again when promoted to the old generation. You can make this kind of benchmark much faster by just using a bigger allocation area.
There's nothing inherently costly about StgMutArrPtrs compared to other objects, except that they are variable size and therefore we can't unroll the copy loop, but I don't think that's a big effect. The actual copying is the major cost.
The way to improve this kind of benchmark would be to add some heuristics for varying the nursery size based on the quantity of data retained, for example. I think there's a lot of room for improvement here, but someone needs to do some careful benchmarking and experimentation. Andrew Farmer did some work on this and allegedly got good results but we never saw the code (hint hint!).
Cheers, Simon
On 1 October 2013 06:43, Johan Tibell
wrote: The code for 'allocate' in rts/sm/Storage.c doesn't seem that expensive. An extra branch compared to inline allocation and allocation is done in the next nursery block (risking fragmentation?).
-- Johan
On Mon, Sep 30, 2013 at 9:50 PM, Johan Tibell
wrote: Hi,
When I benchmark Data.HashMap.insert from unordered-containers (inserting the keys [0..10000]) the runtime is dominated by GC:
$ cat Test.hs module Main where
import Control.DeepSeq import Control.Exception import Control.Monad import qualified Data.HashMap.Strict as HM import Data.List (foldl')
main = do let ks = [0..10000] :: [Int] evaluate (rnf ks) forM_ ([0..1000] :: [Int]) $ \ x -> do evaluate $ HM.null $ foldl' (\ m k -> HM.insert k x m) HM.empty ks
$ perf record -g ./Test +RTS -s 6,187,678,112 bytes allocated in the heap 3,309,887,128 bytes copied during GC 1,299,200 bytes maximum residency (1002 sample(s)) 118,816 bytes maximum slop 5 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause Gen 0 11089 colls, 0 par 1.31s 1.30s 0.0001s 0.0005s Gen 1 1002 colls, 0 par 0.49s 0.51s 0.0005s 0.0022s
INIT time 0.00s ( 0.00s elapsed) MUT time 1.02s ( 1.03s elapsed) GC time 1.80s ( 1.80s elapsed) EXIT time 0.00s ( 0.00s elapsed) Total time 2.82s ( 2.84s elapsed)
%GC time 63.7% (63.5% elapsed)
Alloc rate 6,042,264,963 bytes per MUT second
Productivity 36.3% of total user, 36.1% of total elapsed
$ perf report 41.46% Test Test [.] evacuate 15.47% Test Test [.] scavenge_block 11.04% Test Test [.] s3cN_info 8.74% Test Test [.] s3aZ_info 3.59% Test Test [.] 0x7ff5 2.83% Test Test [.] scavenge_mut_arr_ptrs 2.69% Test libc-2.15.so [.] 0x147fd9 2.51% Test Test [.] allocate 2.00% Test Test [.] s3oo_info 0.91% Test Test [.] todo_block_full 0.87% Test Test [.] hs_popcnt64 0.80% Test Test [.] s3en_info 0.62% Test Test [.] s3el_info
Is GC:ing StgMutArrPtrs and StgArrPtrs, which I create a lot of, more expensive than GC:ing normal heap objects (i.e. for standard data types)?
-- Johan
_______________________________________________ ghc-devs mailing list ghc-devs@haskell.org http://www.haskell.org/mailman/listinfo/ghc-devs

wonderful!
also great meeting you at ICFP
On Tue, Oct 1, 2013 at 4:56 PM, Andrew Farmer
Definitely... I'm somewhat fully occupied for the next two weeks, but should be able to dig it out then and organize/share it. On Oct 1, 2013 3:50 PM, "Carter Schonwald"
wrote: awesome!
please let us know when some of the info is available publicly, perhaps so other folks can help out wiht experimentation
On Tue, Oct 1, 2013 at 4:30 PM, Andrew Farmer
wrote: I did indeed implement dynamic nursery sizing and did some preliminary benchmarking. The headline figure: 15% speedup on the nofib/gc benchmarks, though the variance was pretty large, and there were some slowdowns.
My scheme was very simple... I kept track of the size and rough collection time of the previous three collections and did a sort of crude binary search to find a minimum in the search space. I did it this way because it was simple and required constant time and memory to make a decision. Though one of the conclusions was that collection time was a bad metric, due to the way the RTS re-uses blocks. As Simon pointed out, tracking retainment or some other metric would probably be better, but I need to explore it. Another result: the default size is almost always too small (at least for the nofib programs). CPUs come with huge caches, and using the RTS flag -A to set the allocation area to be roughly the size of the L3 cache usually gave pretty decent speedups.
I did this for a class project, and had to put it down to focus on other things, and just haven't picked it back up. I still have a patch laying around, and several pages of notes with ideas for improvement in both the metric and search. I'm hoping to pick it back up again in a couple months, with an eye on a workshop paper, and a real patch for 7.10.
On Tue, Oct 1, 2013 at 3:36 AM, Simon Marlow
wrote: It's typical for benchmarks that allocate a large data structure to spend a lot of time in the GC. The data gets copied twice - once in the young generation and then again when promoted to the old generation. You can make this kind of benchmark much faster by just using a bigger allocation area.
There's nothing inherently costly about StgMutArrPtrs compared to other objects, except that they are variable size and therefore we can't unroll the copy loop, but I don't think that's a big effect. The actual copying is the major cost.
The way to improve this kind of benchmark would be to add some heuristics for varying the nursery size based on the quantity of data retained, for example. I think there's a lot of room for improvement here, but someone needs to do some careful benchmarking and experimentation. Andrew Farmer did some work on this and allegedly got good results but we never saw the code (hint hint!).
Cheers, Simon
On 1 October 2013 06:43, Johan Tibell
wrote: The code for 'allocate' in rts/sm/Storage.c doesn't seem that expensive. An extra branch compared to inline allocation and allocation is done in the next nursery block (risking fragmentation?).
-- Johan
On Mon, Sep 30, 2013 at 9:50 PM, Johan Tibell
wrote: Hi,
When I benchmark Data.HashMap.insert from unordered-containers (inserting the keys [0..10000]) the runtime is dominated by GC:
$ cat Test.hs module Main where
import Control.DeepSeq import Control.Exception import Control.Monad import qualified Data.HashMap.Strict as HM import Data.List (foldl')
main = do let ks = [0..10000] :: [Int] evaluate (rnf ks) forM_ ([0..1000] :: [Int]) $ \ x -> do evaluate $ HM.null $ foldl' (\ m k -> HM.insert k x m) HM.empty ks
$ perf record -g ./Test +RTS -s 6,187,678,112 bytes allocated in the heap 3,309,887,128 bytes copied during GC 1,299,200 bytes maximum residency (1002 sample(s)) 118,816 bytes maximum slop 5 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause Gen 0 11089 colls, 0 par 1.31s 1.30s 0.0001s 0.0005s Gen 1 1002 colls, 0 par 0.49s 0.51s 0.0005s 0.0022s
INIT time 0.00s ( 0.00s elapsed) MUT time 1.02s ( 1.03s elapsed) GC time 1.80s ( 1.80s elapsed) EXIT time 0.00s ( 0.00s elapsed) Total time 2.82s ( 2.84s elapsed)
%GC time 63.7% (63.5% elapsed)
Alloc rate 6,042,264,963 bytes per MUT second
Productivity 36.3% of total user, 36.1% of total elapsed
$ perf report 41.46% Test Test [.] evacuate 15.47% Test Test [.] scavenge_block 11.04% Test Test [.] s3cN_info 8.74% Test Test [.] s3aZ_info 3.59% Test Test [.] 0x7ff5 2.83% Test Test [.] scavenge_mut_arr_ptrs 2.69% Test libc-2.15.so [.] 0x147fd9 2.51% Test Test [.] allocate 2.00% Test Test [.] s3oo_info 0.91% Test Test [.] todo_block_full 0.87% Test Test [.] hs_popcnt64 0.80% Test Test [.] s3en_info 0.62% Test Test [.] s3el_info
Is GC:ing StgMutArrPtrs and StgArrPtrs, which I create a lot of, more expensive than GC:ing normal heap objects (i.e. for standard data types)?
-- Johan
_______________________________________________ ghc-devs mailing list ghc-devs@haskell.org http://www.haskell.org/mailman/listinfo/ghc-devs
participants (4)
-
Andrew Farmer
-
Carter Schonwald
-
Johan Tibell
-
Simon Marlow