
On Sat, 2006-08-12 at 16:58 +0400, Bulat Ziganshin wrote:
Hello skaller,
hi .. :)
The problem will occur if the 'stack' is aged: in that case the sweep can miss the mutation and reap a live object.
well, i don't know how updatable vars in stack interoperate with GC. let's someone else answer this question.
I know very little about Haskell, let alone GHC internals (I'm mainly using Ocaml at the moment, to implement my own programming language). Haskell seems to be on the right track in many ways: I view the desirable upgrade path for programmers as roughly: C --> C++ --> Felix --> ML --> Haskell which represents a gradual move towards a more functional and declarative style. It is not just that this is 'easier to reason about' for the human programmer, and thus provide better ways of ensuring correctness, but also that it is 'easier to reason about' for machine algorithms, and thus provides for better optimisation and much faster code. The move to multi-processing on desktop computers seems to accelerate the need for better languages than C++ (don't even mention that J*** language .. :)
but afaik GHC 6.5 supports multi-cpu execution of multiple Haskell threads (which makes it a better language for modern multi-core cpus), tail-call optimization and 2-generation GC (actually, it supports any number of generations). also, GHC supports mutable arrays. you search GHC bug tickets for one that optimized GC with mutable vars: in ghc64 _each_ GC, including minor ones need to scan _every_ updatable reference (IORef/STRef) and _each_ element of _every_ updatable array (IOArray/STArray) and this significantly slowdowns many programs, including GHC itself. in ghc65 better scheme now used
Cool. I guess what I'm saying is: there seems to be a contradiction between multi-processing and mutation, perhaps more specifically shared state. The tail-rec and array optimisation stuff may map functional code to a procedural model, and obtain good performance on a uni-processor.. but perhaps it does so at the cost of bad performance on a multi-processor? It may turn out that the underlying reason for preferring a lazy purely functional high level programming language is also a reason for a similar low level execution model. In particular .. it seems to me it defeats high performance garbage collection of the very kind Haskell already uses. The issue isn't making the collector faster .. but being able to run it asynchronously. We all surely conceive the functional model as constructive: you make new objects as you need them, but never change old ones. This is basically how 'mathematics' works. You don't worry about objects you don't need any more. Since we don't have infinite memory on real computers, we use a garbage collector to recycle the unreachable store. And we all 'conceive' that as a transparent background process. Clearly, we need not just an asynchronous reaper -- we also need to be able to execute multiple reapers in parallel, otherwise the system will not scale to a large number of CPUs. But the state of the art is then two stages behind the requirement: Haskell still has to 'world stop' threads to do a major collection. My suggestion that the cause of this is precisely that it is possible to age the independent per thread young heaps so that the aged objects, now in the shared major heap, are mutable. To prevent the mutable parts changing during collection, all the threads have to be stopped. Do I understand correctly this is what actually happens at the moment? To fix the problem, we have to prevent the mutable store ever getting into the major heap. That can be done by throwing out the tail-rec, array, and other optimisations that 'cheat' the collector by manually reusing store, and bypassing the collection. That works for a single thread only because you can't collect and spew objects at the same time, ensuring synchronisation. So I'm bringing into question whether these nice 'optimisations' are actually worthwhile. They actually seem to *degrade* performance, not improve it, when we're running with a large number of CPUs. Stopping the world if you have 20,000 CPU's will happen so often, all the CPU's will be idle 99.99% of the time :) Removing the optimisation will slow down each CPU, but remove any need to stop the world for long periods. The world WILL have to pause to serialise aging the young heaps. It would be even better if the collection space could somehow be partitioned so you could dedicate say 1% of the CPU's to collection, that is, support multiple parallel background collection: that is mandatory for scalability. I note in passing that because the problem is with *aging* mutable store .. the problem also goes away if it is not aged. In particular if all mutable objects are forcibly retained in the young heap, there's no problem. For tail-rec optimisation this is probably viable, and the implementation is possibly quite simple, at least in principle: allow the machine stack as an 'additional' generation, with the special property it only permits FILO allocation in the usual way. This means (a) no copying into the major heap and (b) no compaction (unnecessary). If the system supported this, there should be no problem using the stack for a couple of mutable variables needed for a tail-rec optimisation. [Array might be a different story, don't know] -- John Skaller <skaller at users dot sf dot net> Felix, successor to C++: http://felix.sf.net