Re: Re[6]: Replacement for GMP: Update

13 Aug 2006

      On Sat, 2006-08-12 at 16:58 +0400, Bulat Ziganshin wrote:
...
Hello skaller,
hi .. :)
...
...
The problem will occur if the 'stack' is aged: in that case
the sweep can miss the mutation and reap a live object.
well, i don't know how updatable vars in stack interoperate with GC.
let's someone else answer this question.
I know very little about Haskell, let alone GHC internals
(I'm mainly using Ocaml at the moment, to implement my
own programming language).

Haskell seems to be on the right track in many ways: I view the
desirable upgrade path for programmers as roughly:

C --> C++ --> Felix --> ML --> Haskell

which represents a gradual move towards a more functional
and declarative style. It is not just that this is 
'easier to reason about' for the human programmer, and thus
provide better ways of ensuring correctness, but also that
it is 'easier to reason about' for machine algorithms,
and thus provides for better optimisation and much faster code.

The move to multi-processing on desktop computers seems to
accelerate the need for better languages than C++ (don't even
mention that J*** language .. :)
...
but afaik GHC 6.5 supports
multi-cpu execution of multiple Haskell threads (which makes it a
better language for modern multi-core cpus), tail-call optimization
and 2-generation GC (actually, it supports any number of generations).
also, GHC supports mutable arrays. you search GHC bug tickets for one
that optimized GC with mutable vars: in ghc64 _each_ GC, including
minor ones need to scan _every_ updatable reference (IORef/STRef) and
_each_ element of _every_ updatable array (IOArray/STArray) and this
significantly slowdowns many programs, including GHC itself. in ghc65
better scheme now used
Cool. I guess what I'm saying is: there seems to be a contradiction
between multi-processing and mutation, perhaps more specifically
shared state. The tail-rec and array optimisation stuff may map
functional code to a procedural model, and obtain good performance
on a uni-processor.. but perhaps it does so at the cost of
bad performance on a multi-processor?

It may turn out that the underlying reason for preferring a
lazy purely functional high level programming language is 
also a reason for a similar low level execution model.

In particular .. it seems to me it defeats high performance
garbage collection of the very kind Haskell already uses.
The issue isn't making the collector faster .. but being
able to run it asynchronously. We all surely conceive the
functional model as constructive: you make new objects
as you need them, but never change old ones. This is basically
how 'mathematics' works. You don't worry about objects you
don't need any more. Since we don't have infinite memory
on real computers, we use a garbage collector to recycle
the unreachable store. And we all 'conceive' that as
a transparent background process.

Clearly, we need not just an asynchronous reaper -- we
also need to be able to execute multiple reapers in 
parallel, otherwise the system will not scale to a large
number of CPUs.

But the state of the art is then two stages behind the
requirement: Haskell still has to 'world stop' threads
to do a major collection. My suggestion that the cause
of this is precisely that it is possible to age the
independent per thread young heaps so that the aged
objects, now in the shared major heap, are mutable.
To prevent the mutable parts changing during collection,
all the threads have to be stopped. 

Do I understand correctly this is what actually happens
at the moment?

To fix the problem, we have to prevent the mutable
store ever getting into the major heap. That can
be done by throwing out the tail-rec, array, and
other optimisations that 'cheat' the collector
by manually reusing store, and bypassing the collection.
That works for a single thread only because you can't
collect and spew objects at the same time, ensuring
synchronisation. 

So I'm bringing into question whether these nice
'optimisations' are actually worthwhile. They actually
seem to *degrade* performance, not improve it, when we're
running with a large number of CPUs. Stopping the world
if you have 20,000 CPU's will happen so often, all the
CPU's will be idle 99.99% of the time :)

Removing the optimisation will slow down each CPU,
but remove any need to stop the world for long periods.
The world WILL have to pause to serialise aging the
young heaps.

It would be even better if the collection space could
somehow be partitioned so you could dedicate say 1%
of the CPU's to collection, that is, support multiple
parallel background collection: that is mandatory 
for scalability.

I note in passing that because the problem is with
*aging* mutable store .. the problem also goes away
if it is not aged. In particular if all mutable
objects are forcibly retained in the young heap,
there's no problem. For tail-rec optimisation this
is probably viable, and the implementation is possibly
quite simple, at least in principle: allow the machine
stack as an 'additional' generation, with the special
property it only permits FILO allocation in the
usual way.  This means (a) no copying into the major
heap and (b) no compaction (unnecessary). 

If the system supported this, there should
be no problem using the stack for a couple of mutable
variables needed for a tail-rec optimisation.
[Array might be a different story, don't know]

-- 
John Skaller <skaller at users dot sf dot net>
Felix, successor to C++: http://felix.sf.net