Bringing the cafe back in.

If I remember correctly tuning the GC is one of the things they worked on for the next release (in relation to parallelism).
Here's a link to the paper: http://research.microsoft.com/en-us/um/people/simonpj/papers/parallel/multicore-ghc.pdf

You can allocate dynamic memory on the C heap using the Foreign.Marshal.Alloc stuff. The Storable class allows you to specify alignment. 
However it's not clear you will need it in all cases. Most of the time IME all you really need is for "the business end" of object A and B to end up on different cache lines, padding them with the size of a cache line will accomplish that even if each object starts in the middle of a cache line. In other words, as long as you ensure that there's at least CACHE_LINE bytes difference between the last bit of data in A and the first bit of data in B, they won't exhibit false sharing.

I'm not sure if any of the "standard" mutable references etc. respect this. Maybe someone can clarify? If you modify a thread-local IORef will that cause another thread to stall because it has it's own thread-local IORef on the same cache line?

On Fri, Aug 7, 2009 at 12:42 PM, Thomas Witzel <witzel.thomas@gmail.com> wrote:
I have not yet been able to get the dev version compiled, but I'm
looking into it (basically compiles to 99% and then stops because of a
missing or defective .mk file). Anyhow, I played more with the release
version and one thing I noticed is that the GC time increases almost
linearly with the number of cores. Is there somewhere a document that
explains the overall architecture of the GHC/RTS and how in especially
the GC acts in concurrent situations ?

As for your comment regarding the padding the data-structures, I'll of
course also need to control the alignment in such a case. Is there
such a thing as explicit dynamic memory allocation in Haskell ?

Thanks, Thomas

On Wed, Aug 5, 2009 at 3:16 PM, Sebastian
Sylvan<sebastian.sylvan@gmail.com> wrote:
> GHC doesn't have per-thread allocation so it's probably a bit tricky to get
> that working. Plus, for parallelism it's not clear that a piece of data is
> necessarily "owned" by one thead, since it could be produced by a spark and
> consumed by another spark, those two independent sparks may not necessarily
> occupy the same thread, which means that any *other* data accessed by the
> firs thread could thrash the cache. So really you'd need per-spark
> allocation areas which would probably make sparks very heavy weight.
> In other words, I think there's plenty of research that needs to be done
> w.r.t. scheduling things in time and space so as to avoid false sharing. You
> could, of course, always chunk your work manually, and make sure that each
> "chunk" works on a big block that won't share cache lines with anything else
> (e.g. by padding the data structures).
> Also, while GHC does a fair bit of mutation on its own internal data (thunks
> etc.), most of the "user data" is read-only, which should help. I.e. once a
> cache line has been filled up, there won't be any synchronisation needed on
> that data again.
>
> On Wed, Aug 5, 2009 at 8:04 PM, Thomas Witzel <witzel.thomas@gmail.com>
> wrote:
>>
>> I'll try that. I'd like to stick with it. As for the memory, although
>> its probably quite a bit of work, it should be doable to have code
>> generated where the threads have their own, non-overlapping, memory
>> pages, so that the CPUs don't go into a cache-thrashing death-match.
>> I'll spend some more time with Haskell and then go from there.
>>
>> On Wed, Aug 5, 2009 at 3:01 PM, Sebastian
>> Sylvan<sebastian.sylvan@gmail.com> wrote:
>> >
>> >
>> > On Wed, Aug 5, 2009 at 6:59 PM, Thomas Witzel <witzel.thomas@gmail.com>
>> > wrote:
>> >>
>> >> 2. I started with the very simple nfib example given in the manual for
>> >> Control.Parallel (Section 7.18). On my systems using multiple cores
>> >> makes the code actually slower than just using a single core. While
>> >> the manual cautions that this could be the case for certain
>> >> algorithms, I'm wondering whether this is the desired behaviour for
>> >> this example.
>> >>
>> >> I'm using ghc 6.10.4 right now.
>> >
>> > IIRC the development version of GHC has some major work to optimize
>> > concurrency, so it may be worth trying that. In particular I believe it
>> > executes sparks in batches, to reduce the overhead (which hopefully
>> > fixes
>> > your issue).
>> >
>> > --
>> > Sebastian Sylvan
>> >
>
>
>
> --
> Sebastian Sylvan
>



--
Sebastian Sylvan