"+RTS -A" parameter and CPU cache size

Hello glasgow-haskell-users, "+RTS -A" parameter sets up size of memory allocation area. by default it set to 256kb and this means that after each 256 kb allocated minor GC occurs and data from this block that is still alive are moved to main heap area (what is scanned only on major GCs) while rest of space is freed. also on each minor GC all mutable variables (IORef/STRef) and mutavle arrays (IOArray/STArray) are scanned and therefore programs that use them heavily (including GHC itself) works faster when value of "-A" parameter is increased. For example, by setting this parameter to 10mb you can decrease count of minor GCs in 40 times. The speed change depends on concrete program, for GHC itself it may be 20-50% faster. that is the example of using this option to speed up compilation: ghc --make main.hs +RTS -A10m well, but what if we don't have many global variables and therefore don't need to increase -A? in this case it's better to have such value of this parameter that all "local data" hold in L1+L2 CPU caches. that is especially important when program generates large number of temporary data cells with very short lifetime. i've found a good example of such program in my own vGetLine-testing code - it reads millions of 20-char lines in one loop. my CPU has 128k total data cache and for this task the most efficient -A value was about 100kb, what allows this benchmark to run 2x faster that with standard -A256k value in general, data cache should be enough to hold: 1) "nursery area", whose size is set by -A 2) part of data from this nursery area what remains alive after minor GC (it may be anything between 0 kb and whole nursery area) 3) all data outside of nursery area what program access between two minor GCs (unpredictable value, in general) knowing this all, we can try to calculate best size of nursery area (i.e., "-A" value), depended on CPU cache size. although we can't find exact optimum because some ratios in list above are unknown, we still can say that optimization should be done for cases where amount of "external" (outside nursery) data accessed between GCs is no more than size of nursery itself and that in typical cases more than 50% of data in nursery will be freed by minor GC. based on these assumptions, we can make conclusion that nursery size should be about 1/2 of cache size. if nursery size is greater or equal to cpu cache size, then it will work inefficient, overfilling L2 cache and "trashing" memory - just like the program that uses little more memory than physically available trashes the swap disk now let's analyze effective data cache size of today CPUs: celeron - 128k-256k sempron, athlon, athlon xp - 192k-320k (64k of L1 + 128k-256k of L2 cache) celeron d, athlon 64, pentium4, pentium d - 512k-2048k as you can see, most of cpus, used in today systems, don't have cache large enough to work efficiently with 256k nursery size (according to my calculations above, it requires 512k or larger cache). this results in sub-optimal performance of ghc-compiled programs. according to my tests, every algorithm tested can be made faster either by increasing -A to 10mb, or decreasing to 1/2 of my cpu cache! the only problem is to automatically decide what to do for each particular algorithm - increase or decrease? :) moreover, when decreasing -A to 64k. speed loss for some algorithms can be also significant - up to 1.5 times. of course, this is the same algorithms that take advantage when -A is significantly increased (see first paragraph for explanations). this problem don't allow me to recommend automatic decreasing of -A for low-level cpus with GHC 6.4, but fortunately it should almost gone in ghc 6.6 taking this all into account, i propose the following: 1) GHC faqs already suggests using of RTS -A/-H option to speed up compilation. i propose to move this suggestion right to compiler itself and add the line char *ghc_rts_opts = "-A10m"; to GHC 6.4 sources 2) i propose to write "L2 cache size detection" code and use it in GHC 6.6 RTS to setup initial value of "-A" option. in order to allow program tune itself to any cpu architecture, with cache sizes ranging from 128kb to 4mb. this will allow low-level cpus to run significantly faster on some algorithms (up to 2x, as i said above) and can give 5-10% speedup for high-level cpus, that is also not so bad :) -- Best regards, Bulat mailto:Bulat.Ziganshin@gmail.com

Bulat Ziganshin wrote:
taking this all into account, i propose the following:
1) GHC faqs already suggests using of RTS -A/-H option to speed up compilation. i propose to move this suggestion right to compiler itself and add the line
char *ghc_rts_opts = "-A10m";
to GHC 6.4 sources
Do you have some evidence that -A10m is a good default? Better than -A6m, or -A16m, for example? GHC currently runs with -H6m by default. I'm happy to change this (in 6.4 only, I suppose), if you have evidence that we're sitting in a bad place on the curve. It's a space/time tradeoff, as usual. Typically we've been quite conservative with space usage in the past, this is why the defaults tend to be quite low.
2) i propose to write "L2 cache size detection" code and use it in GHC 6.6 RTS to setup initial value of "-A" option. in order to allow program tune itself to any cpu architecture, with cache sizes ranging from 128kb to 4mb. this will allow low-level cpus to run significantly faster on some algorithms (up to 2x, as i said above) and can give 5-10% speedup for high-level cpus, that is also not so bad :)
That sounds like a good plan. I've been experimenting with PAPI recently (http://icl.cs.utk.edu/papi/) which can tell you the size of your caches, but it requires kernel patches. Cheers, Simon

Hello Simon, Friday, June 16, 2006, 3:48:06 PM, you wrote:
char *ghc_rts_opts = "-A10m"; Do you have some evidence that -A10m is a good default? Better than -A6m, or -A16m, for example? GHC currently runs with -H6m by default.
of course, "-a6m" is not worser than "-a10m". but "-h" is not a good alternative because after memory usage grows above 6 mb, ghc starts to do minor GCs after each 256 kb allocated. with "-A6m" number of minor GCs will be dramatically cut off i just quickly tested non-optimized compilation of my program with "-A6m" and with "-H6m" (ghc 6.4.2.20060609) - wall clock time of compilation was 63 seconds against 76 seconds, while memory usage was 70 mb against 68 mb btw, now i use "+RTS -c" in my compilations - it works fine. i'm happy :)
2) i propose to write "L2 cache size detection" code and use it in GHC 6.6 RTS to setup initial value of "-A" option. in order to allow program tune itself to any cpu architecture, with cache sizes ranging from 128kb to 4mb. this will allow low-level cpus to run significantly faster on some algorithms (up to 2x, as i said above) and can give 5-10% speedup for high-level cpus, that is also not so bad :)
That sounds like a good plan. I've been experimenting with PAPI recently (http://icl.cs.utk.edu/papi/) which can tell you the size of your caches, but it requires kernel patches.
for me, Windows compatibility is most important issue :) -- Best regards, Bulat mailto:Bulat.Ziganshin@gmail.com

On 16 June 2006 13:27, Bulat Ziganshin wrote:
Friday, June 16, 2006, 3:48:06 PM, you wrote:
char *ghc_rts_opts = "-A10m"; Do you have some evidence that -A10m is a good default? Better than -A6m, or -A16m, for example? GHC currently runs with -H6m by default.
of course, "-a6m" is not worser than "-a10m". but "-h" is not a good alternative because after memory usage grows above 6 mb, ghc starts to do minor GCs after each 256 kb allocated. with "-A6m" number of minor GCs will be dramatically cut off
i just quickly tested non-optimized compilation of my program with "-A6m" and with "-H6m" (ghc 6.4.2.20060609) - wall clock time of compilation was 63 seconds against 76 seconds, while memory usage was 70 mb against 68 mb
What I'm getting at is what you think the best value for -A might be. I'm aware of the problem exposed by using -H on its own. So -A6m is no worse than -A10m? In that case, -A6m is clearly to be preferred; but is there a better point between 6m and 256k?
btw, now i use "+RTS -c" in my compilations - it works fine. i'm happy :)
Doesn't that increase GC time significantly? Cheers, Simon

Hello Simon, Friday, June 16, 2006, 4:50:15 PM, you wrote:
char *ghc_rts_opts = "-A10m"; Do you have some evidence that -A10m is a good default? Better than -A6m, or -A16m, for example? GHC currently runs with -H6m by default.
of course, "-a6m" is not worser than "-a10m". but "-h" is not a good alternative because after memory usage grows above 6 mb, ghc starts to do minor GCs after each 256 kb allocated. with "-A6m" number of minor GCs will be dramatically cut off
i just quickly tested non-optimized compilation of my program with "-A6m" and with "-H6m" (ghc 6.4.2.20060609) - wall clock time of compilation was 63 seconds against 76 seconds, while memory usage was 70 mb against 68 mb
What I'm getting at is what you think the best value for -A might be. I'm aware of the problem exposed by using -H on its own. So -A6m is no worse than -A10m? In that case, -A6m is clearly to be preferred; but is there a better point between 6m and 256k?
i think that when "-A" is larger than cpu cache, it's increasing should decrease GC times. there is no "best point", it's just memory/speed compromise. when -A is large enough, GC times should be a small percent of total execution time, so it should be no much difference between 6m or 16m settings. but 16m should be slightly faster than 6m, and 32m should be slightly faster than 16m i've just tested GHC with "-A6m" and "-A16m" and it was only 1 second faster with second setting (of total 60 seconds)
btw, now i use "+RTS -c" in my compilations - it works fine. i'm happy :)
Doesn't that increase GC time significantly?
of course, but it's much better than running out of physical memory. to be exact, i use "-A10m -M256m -c20" for my 256mb of RAM (i disable these options when i do above mentioned tests) -- Best regards, Bulat mailto:Bulat.Ziganshin@gmail.com

On 16 June 2006 15:51, Bulat Ziganshin wrote:
i think that when "-A" is larger than cpu cache, it's increasing should decrease GC times. there is no "best point", it's just memory/speed compromise. when -A is large enough, GC times should be a small percent of total execution time, so it should be no much difference between 6m or 16m settings. but 16m should be slightly faster than 6m, and 32m should be slightly faster than 16m
Well yes, obviously.
i've just tested GHC with "-A6m" and "-A16m" and it was only 1 second faster with second setting (of total 60 seconds)
Thanks. Simon
participants (3)
-
Bulat Ziganshin
-
Simon Marlow
-
Simon Marlow