Re: [Haskell-cafe] Basic questions about concurrency in Haskell

Bringing the cafe back in.
If I remember correctly tuning the GC is one of the things they worked on
for the next release (in relation to parallelism).Here's a link to the
paper:
http://research.microsoft.com/en-us/um/people/simonpj/papers/parallel/multic...
You can allocate dynamic memory on the C heap using the
Foreign.Marshal.Alloc stuff. The Storable class allows you to specify
alignment.
However it's not clear you will need it in all cases. Most of the time IME
all you really need is for "the business end" of object A and B to end up on
different cache lines, padding them with the size of a cache line will
accomplish that even if each object starts in the middle of a cache line. In
other words, as long as you ensure that there's at least CACHE_LINE bytes
difference between the last bit of data in A and the first bit of data in B,
they won't exhibit false sharing.
I'm not sure if any of the "standard" mutable references etc. respect this.
Maybe someone can clarify? If you modify a thread-local IORef will that
cause another thread to stall because it has it's own thread-local IORef on
the same cache line?
On Fri, Aug 7, 2009 at 12:42 PM, Thomas Witzel
I have not yet been able to get the dev version compiled, but I'm looking into it (basically compiles to 99% and then stops because of a missing or defective .mk file). Anyhow, I played more with the release version and one thing I noticed is that the GC time increases almost linearly with the number of cores. Is there somewhere a document that explains the overall architecture of the GHC/RTS and how in especially the GC acts in concurrent situations ?
As for your comment regarding the padding the data-structures, I'll of course also need to control the alignment in such a case. Is there such a thing as explicit dynamic memory allocation in Haskell ?
Thanks, Thomas
On Wed, Aug 5, 2009 at 3:16 PM, Sebastian Sylvan
wrote: GHC doesn't have per-thread allocation so it's probably a bit tricky to get that working. Plus, for parallelism it's not clear that a piece of data is necessarily "owned" by one thead, since it could be produced by a spark and consumed by another spark, those two independent sparks may not necessarily occupy the same thread, which means that any *other* data accessed by the firs thread could thrash the cache. So really you'd need per-spark allocation areas which would probably make sparks very heavy weight. In other words, I think there's plenty of research that needs to be done w.r.t. scheduling things in time and space so as to avoid false sharing. You could, of course, always chunk your work manually, and make sure that each "chunk" works on a big block that won't share cache lines with anything else (e.g. by padding the data structures). Also, while GHC does a fair bit of mutation on its own internal data (thunks etc.), most of the "user data" is read-only, which should help. I.e. once a cache line has been filled up, there won't be any synchronisation needed on that data again.
On Wed, Aug 5, 2009 at 8:04 PM, Thomas Witzel
wrote: I'll try that. I'd like to stick with it. As for the memory, although its probably quite a bit of work, it should be doable to have code generated where the threads have their own, non-overlapping, memory pages, so that the CPUs don't go into a cache-thrashing death-match. I'll spend some more time with Haskell and then go from there.
On Wed, Aug 5, 2009 at 3:01 PM, Sebastian Sylvan
wrote: On Wed, Aug 5, 2009 at 6:59 PM, Thomas Witzel <
witzel.thomas@gmail.com>
wrote:
2. I started with the very simple nfib example given in the manual
for
Control.Parallel (Section 7.18). On my systems using multiple cores makes the code actually slower than just using a single core. While the manual cautions that this could be the case for certain algorithms, I'm wondering whether this is the desired behaviour for this example.
I'm using ghc 6.10.4 right now.
IIRC the development version of GHC has some major work to optimize concurrency, so it may be worth trying that. In particular I believe it executes sparks in batches, to reduce the overhead (which hopefully fixes your issue).
-- Sebastian Sylvan
-- Sebastian Sylvan
-- Sebastian Sylvan
participants (1)
-
Sebastian Sylvan