New subject: Re[2]: "+RTS -A" parameter and CPU cache size

14 Jun 2006

      Hello glasgow-haskell-users,

"+RTS -A" parameter sets up size of memory allocation area. by default
it set to 256kb and this means that after each 256 kb allocated minor
GC occurs and data from this block that is still alive are moved to
main heap area (what is scanned only on major GCs) while rest of space
is freed. also on each minor GC all mutable variables (IORef/STRef)
and mutavle arrays (IOArray/STArray) are scanned and therefore
programs that use them heavily (including GHC itself) works faster
when value of "-A" parameter is increased. For example, by setting
this parameter to 10mb you can decrease count of minor GCs in 40 times.
The speed change depends on concrete program, for GHC itself it may be
20-50% faster. that is the example of using this option to speed up
compilation:

ghc --make main.hs +RTS -A10m

well, but what if we don't have many global variables and therefore
don't need to increase -A? in this case it's better to have such value
of this parameter that all "local data" hold in L1+L2 CPU caches. that
is especially important when program generates large number of
temporary data cells with very short lifetime. i've found a good
example of such program in my own vGetLine-testing code - it reads
millions of 20-char lines in one loop. my CPU has 128k total data
cache and for this task the most efficient -A value was about 100kb,
what allows this benchmark to run 2x faster that with standard -A256k
value

in general, data cache should be enough to hold:

1) "nursery area", whose size is set by -A
2) part of data from this nursery area what remains alive after minor
GC (it may be anything between 0 kb and whole nursery area)
3) all data outside of nursery area what program access between two
minor GCs (unpredictable value, in general)

knowing this all, we can try to calculate best size of nursery area
(i.e., "-A" value), depended on CPU cache size. although we can't find
exact optimum because some ratios in list above are unknown, we still
can say that optimization should be done for cases where amount of
"external" (outside nursery) data accessed between GCs is no more than
size of nursery itself and that in typical cases more than 50% of data
in nursery will be freed by minor GC. based on these assumptions, we
can make conclusion that nursery size should be about 1/2 of cache
size. if nursery size is greater or equal to cpu cache size, then it
will work inefficient, overfilling L2 cache and "trashing" memory -
just like the program that uses little more memory than physically
available trashes the swap disk

now let's analyze effective data cache size of today CPUs:

celeron - 128k-256k
sempron, athlon, athlon xp - 192k-320k (64k of L1 + 128k-256k of L2 cache)
celeron d, athlon 64, pentium4, pentium d - 512k-2048k

as you can see, most of cpus, used in today systems, don't have cache
large enough to work efficiently with 256k nursery size (according to
my calculations above, it requires 512k or larger cache). this results
in sub-optimal performance of ghc-compiled programs. according to my
tests, every algorithm tested can be made faster either by increasing -A to
10mb, or decreasing to 1/2 of my cpu cache! the only problem is to
automatically decide what to do for each particular algorithm -
increase or decrease? :) 

moreover, when decreasing -A to 64k. speed loss for some algorithms
can be also significant - up to 1.5 times. of course, this is the same
algorithms that take advantage when -A is significantly increased (see
first paragraph for explanations). this problem don't allow me to
recommend automatic decreasing of -A for low-level cpus with GHC 6.4,
but fortunately it should almost gone in ghc 6.6

taking this all into account, i propose the following:

1) GHC faqs already suggests using of RTS -A/-H option to speed up
compilation. i propose to move this suggestion right to compiler itself
and add the line

char *ghc_rts_opts = "-A10m";

to GHC 6.4 sources

2) i propose to write "L2 cache size detection" code and use it in GHC
6.6 RTS to setup initial value of "-A" option. in order to allow program
tune itself to any cpu architecture, with cache sizes ranging from
128kb to 4mb. this will allow low-level cpus to run significantly
faster on some algorithms (up to 2x, as i said above) and can give
5-10% speedup for high-level cpus, that is also not so bad :)

-- 
Best regards,
 Bulat                          mailto:Bulat.Ziganshin@gmail.com

"+RTS -A" parameter and CPU cache size

Bulat Ziganshin

Simon Marlow

Bulat Ziganshin

Simon Marlow

Bulat Ziganshin

Simon Marlow

tags

participants (3)