Re: [Haskell-cafe] Basic questions about concurrency in Haskell

7 Aug 2009

      Bringing the cafe back in.

If I remember correctly tuning the GC is one of the things they worked on
for the next release (in relation to parallelism).Here's a link to the
paper:
http://research.microsoft.com/en-us/um/people/simonpj/papers/parallel/multic...

You can allocate dynamic memory on the C heap using the
Foreign.Marshal.Alloc stuff. The Storable class allows you to specify
alignment.
However it's not clear you will need it in all cases. Most of the time IME
all you really need is for "the business end" of object A and B to end up on
different cache lines, padding them with the size of a cache line will
accomplish that even if each object starts in the middle of a cache line. In
other words, as long as you ensure that there's at least CACHE_LINE bytes
difference between the last bit of data in A and the first bit of data in B,
they won't exhibit false sharing.

I'm not sure if any of the "standard" mutable references etc. respect this.
Maybe someone can clarify? If you modify a thread-local IORef will that
cause another thread to stall because it has it's own thread-local IORef on
the same cache line?

On Fri, Aug 7, 2009 at 12:42 PM, Thomas Witzel wrote:
...
I have not yet been able to get the dev version compiled, but I'm
looking into it (basically compiles to 99% and then stops because of a
missing or defective .mk file). Anyhow, I played more with the release
version and one thing I noticed is that the GC time increases almost
linearly with the number of cores. Is there somewhere a document that
explains the overall architecture of the GHC/RTS and how in especially
the GC acts in concurrent situations ?
As for your comment regarding the padding the data-structures, I'll of
course also need to control the alignment in such a case. Is there
such a thing as explicit dynamic memory allocation in Haskell ?
Thanks, Thomas
On Wed, Aug 5, 2009 at 3:16 PM, Sebastian
Sylvan wrote:
...
GHC doesn't have per-thread allocation so it's probably a bit tricky to
get
that working. Plus, for parallelism it's not clear that a piece of data
is
necessarily "owned" by one thead, since it could be produced by a spark
and
consumed by another spark, those two independent sparks may not
necessarily
occupy the same thread, which means that any *other* data accessed by the
firs thread could thrash the cache. So really you'd need per-spark
allocation areas which would probably make sparks very heavy weight.
In other words, I think there's plenty of research that needs to be done
w.r.t. scheduling things in time and space so as to avoid false sharing.
You
could, of course, always chunk your work manually, and make sure that
each
"chunk" works on a big block that won't share cache lines with anything
else
(e.g. by padding the data structures).
Also, while GHC does a fair bit of mutation on its own internal data
(thunks
etc.), most of the "user data" is read-only, which should help. I.e. once
a
cache line has been filled up, there won't be any synchronisation needed
on
that data again.
On Wed, Aug 5, 2009 at 8:04 PM, Thomas Witzel 
wrote:
...
I'll try that. I'd like to stick with it. As for the memory, although
its probably quite a bit of work, it should be doable to have code
generated where the threads have their own, non-overlapping, memory
pages, so that the CPUs don't go into a cache-thrashing death-match.
I'll spend some more time with Haskell and then go from there.
On Wed, Aug 5, 2009 at 3:01 PM, Sebastian
Sylvan wrote:
...
On Wed, Aug 5, 2009 at 6:59 PM, Thomas Witzel <
witzel.thomas@gmail.com>
...
...
wrote:
...
2. I started with the very simple nfib example given in the manual
for
...
Control.Parallel (Section 7.18). On my systems using multiple cores
makes the code actually slower than just using a single core. While
the manual cautions that this could be the case for certain
algorithms, I'm wondering whether this is the desired behaviour for
this example.
I'm using ghc 6.10.4 right now.
IIRC the development version of GHC has some major work to optimize
concurrency, so it may be worth trying that. In particular I believe
it
executes sparks in batches, to reduce the overhead (which hopefully
fixes
your issue).
--
Sebastian Sylvan
--
Sebastian Sylvan
-- 
Sebastian Sylvan

Sebastian Sylvan

tags

participants (1)