Reducing the cost of garbage collecting small arrays

Hi, Now when we can efficiently copy small arrays I've gone back to benchmarking/optimizing my hash array mapped trie data structure. The data structure's performance depends on how efficiently we can allocate/collect/copy Array#s and MutableArrays#s of size <=32. Speeding up copying helped quite a bit, but the HAMT is still slower than a Patricia tree for inserts. If you look at the data below you can see that GC time totally dominates the runtime. Is there anything that can be done to make the GC cheaper in this case? If I'm reading the output below right it's not allocating the space that takes most of the time, but later evacuating/scavenging it. Benchmark ========= {-# LANGUAGE BangPatterns #-} module Main (main) where import Data.List import qualified Data.HashMap as HM test :: HM.HashMap Int Int -> Int -> HM.HashMap Int Int test !m 0 = m test m n = test (HM.insert n n m) (n-1) {-# NOINLINE test #-} main = print $ HM.null $ test HM.empty 1000000 Hash array mapped trie ====================== $ time perf record -f ./test False [ perf record: Woken up 13 times to write data ] [ perf record: Captured and wrote 1.672 MB perf.data (~73041 samples) ] real 0m2.562s user 0m0.020s sys 0m0.010s # Samples: 72980 # # Overhead Command Shared Object Symbol # ........ ............... .......................... ...... # 52.79% test ./test [.] evacuate 9.49% test ./test [.] 0x000000000b4544 8.47% test ./test [.] scavenge_block 6.27% test ./test [.] s4UJ_info 4.67% test ./test [.] s1Od_info 4.22% test ./test [.] scavenge_mut_arr_ptrs 3.68% test ./test [.] s4N7_info 0.72% test ./test [.] todo_block_full 0.55% test ./test [.] GarbageCollect 0.55% test ./test [.] freeGroup Patricia tree ============= $ time perf record -f ./test False [ perf record: Woken up 11 times to write data ] [ perf record: Captured and wrote 1.292 MB perf.data (~56470 samples) ] real 0m1.984s user 0m0.020s sys 0m0.010s # Samples: 56409 # # Overhead Command Shared Object Symbol # ........ ............... .................... ...... # 49.30% test ./test [.] evacuate 24.75% test ./test [.] sa4B_info 14.56% test ./test [.] scavenge_block 1.31% test ./test [.] sa4k_info 1.25% test ./test [.] sa4m_info 1.25% test ./test [.] 0x00000000004108 0.72% test [kernel] [k] clear_page_c 0.66% test ./test [.] todo_block_full 0.46% test [kernel] [k] page_fault 0.44% test ./test [.] GarbageCollect P.S. I also need to figure out what 0x000000000b4544 corresponds to in the first perf events report. I thought my fix to the generated assembly files should have taken care of the unknown symbol problem. Cheers, Johan

On Wed, Jun 22, 2011 at 12:33 PM, Johan Tibell
Hi,
Now when we can efficiently copy small arrays I've gone back to benchmarking/optimizing my hash array mapped trie data structure. The data structure's performance depends on how efficiently we can allocate/collect/copy Array#s and MutableArrays#s of size <=32. Speeding up copying helped quite a bit, but the HAMT is still slower than a Patricia tree for inserts.
Is 5 the optimal number of bits to slice off at a time (ie the best fanout)? It sounds like node copy cost on insert argues for a slightly narrower fanout. You'll be evacuating / scanning more words total, but new nodes may equate to less scanning overall (especially if this is running long enough to have some nodes get tenure). -Jan

On Thu, Jun 23, 2011 at 12:31 AM, Jan-Willem Maessen
On Wed, Jun 22, 2011 at 12:33 PM, Johan Tibell
wrote: Hi,
Now when we can efficiently copy small arrays I've gone back to benchmarking/optimizing my hash array mapped trie data structure. The data structure's performance depends on how efficiently we can allocate/collect/copy Array#s and MutableArrays#s of size <=32. Speeding up copying helped quite a bit, but the HAMT is still slower than a Patricia tree for inserts.
Is 5 the optimal number of bits to slice off at a time (ie the best fanout)? It sounds like node copy cost on insert argues for a slightly narrower fanout. You'll be evacuating / scanning more words total, but new nodes may equate to less scanning overall (especially if this is running long enough to have some nodes get tenure).
I'm experimenting with this. 6 is far too much, making inserts 4-5x slower. 4 doesn't seem to improve things much (which is a bit counter-intuitive given that 6 made things so much work), but I need to experiment some more. Cheers, Johan

On Thu, Jun 23, 2011 at 8:27 AM, Johan Tibell
Is 5 the optimal number of bits to slice off at a time (ie the best fanout)? It sounds like node copy cost on insert argues for a slightly narrower fanout. You'll be evacuating / scanning more words total, but new nodes may equate to less scanning overall (especially if this is running long enough to have some nodes get tenure).
I'm experimenting with this. 6 is far too much, making inserts 4-5x slower. 4 doesn't seem to improve things much (which is a bit counter-intuitive given that 6 made things so much work), but I need to experiment some more.
After some more testing I can improve the performance of insert by 30% by reducing the array size from 32 to 16. GC still dominates though. Johan

On Thu, Jun 23, 2011 at 4:54 AM, Johan Tibell
On Thu, Jun 23, 2011 at 8:27 AM, Johan Tibell
wrote: Is 5 the optimal number of bits to slice off at a time (ie the best fanout)? It sounds like node copy cost on insert argues for a slightly narrower fanout. You'll be evacuating / scanning more words total, but new nodes may equate to less scanning overall (especially if this is running long enough to have some nodes get tenure).
I'm experimenting with this. 6 is far too much, making inserts 4-5x slower. 4 doesn't seem to improve things much (which is a bit counter-intuitive given that 6 made things so much work), but I need to experiment some more.
After some more testing I can improve the performance of insert by 30% by reducing the array size from 32 to 16. GC still dominates though.
Yes, I'd still expect that; internal node churn with fat nodes exhausts heap more quickly than usual. If large nodes become the norm, cranking up GC nursery size might be in order. It's great to see that fat nodes finally work well in GHC, though. When I first tried this in the 90's it was so bad (due to the extra level of indirection & write barrier problems) I gave up in frustration. -Jan
Johan

On Thu, Jun 23, 2011 at 3:27 PM, Jan-Willem Maessen
Yes, I'd still expect that; internal node churn with fat nodes exhausts heap more quickly than usual. If large nodes become the norm, cranking up GC nursery size might be in order.
It's great to see that fat nodes finally work well in GHC, though. When I first tried this in the 90's it was so bad (due to the extra level of indirection & write barrier problems) I gave up in frustration.
I just tested reducing the fanout to 8, which actually makes things slower. Johan

On 23/06/2011 09:54, Johan Tibell wrote:
On Thu, Jun 23, 2011 at 8:27 AM, Johan Tibell
wrote: Is 5 the optimal number of bits to slice off at a time (ie the best fanout)? It sounds like node copy cost on insert argues for a slightly narrower fanout. You'll be evacuating / scanning more words total, but new nodes may equate to less scanning overall (especially if this is running long enough to have some nodes get tenure).
I'm experimenting with this. 6 is far too much, making inserts 4-5x slower. 4 doesn't seem to improve things much (which is a bit counter-intuitive given that 6 made things so much work), but I need to experiment some more.
After some more testing I can improve the performance of insert by 30% by reducing the array size from 32 to 16. GC still dominates though.
I will try to look at this when I have some time. It could just be the cost of copying the extra data, though. This is probably another one of those cases where generational GC isn't helping, long-lived data is getting copied multiple times as it progresses from the young generation to the old. I want to look into some heuristics to try to detect when this is happening and vary the nursery size automatically. Cheers, Simon
participants (3)
-
Jan-Willem Maessen
-
Johan Tibell
-
Simon Marlow