Google Summer of Code - Lock-free data structures - Haskell-Cafe

Google Summer of Code - Lock-free data structures

Florian Hartwig

19 Mar 2012 19 Mar '12

12:37 a.m.

Hi everyone, I would like to do the GSoC project outlined in http://hackage.haskell.org/trac/summer-of-code/ticket/1608 One of Haskell's great strengths is its support for all kinds of concurrent and parallel programmming, so I think that the Haskell ecosystem would benefit from having a wider range of efficient concurrent data structures. Also, more selfishly, I remember reading the article in CACM last summer and thinking that the whole idea of updating data structures directly using atomic compare-and-swap was really cool, so I would love to have an excuse to play around with it. Some (initial) ideas for lock-free data structures worth implementing: 1. A lock-free priority queue, e.g. using the approach based on skip-lists explained in [2] 2. Stacks, see [1] and [3] 3. Hash tables [4] but if anyone has other suggestions, I would obviously happy to hear them. GSoC stretches over 13 weeks. I would estimate that implementing a data structure, writing tests, benchmarks, documentation etc. should not take more than 3 weeks (it is supposed to be full-time work, after all), which means that I could implement 4 of them in the time available and still have some slack. About me: My name is Florian Hartwig, I'm a fifth year (Master's) student in Computing Science at the University of Glasgow. I've been using Haskell for a bit more than two years now (both for university courses and my recreational programming) and I'm currently using it for my Master's project, so I won't have to spend any time learning the Haskell basics. I don't have any other plans for the summer that might conflict with the project. I'd be thankful for any thoughts, questions and suggestions. Cheers, Florian [1] http://cacm.acm.org/magazines/2011/3/105308-data-structures-in-the-multicore... [2] http://www.sciencedirect.com/science/article/pii/S0743731504002333 [3] http://dl.acm.org/citation.cfm?id=1007944 [4] http://dl.acm.org/citation.cfm?id=564881

Show replies by date

Chris Smith

19 Mar 19 Mar

12:59 a.m.

On Mar 18, 2012 6:39 PM, "Florian Hartwig" wrote:

...

GSoC stretches over 13 weeks. I would estimate that implementing a data structure, writing tests, benchmarks, documentation etc. should not take more than 3 weeks (it is supposed to be full-time work, after all), which means that I could implement 4 of them in the time available and still have some slack.

Don't underestimate the time required for performance tuning, and be careful to leave yourself learning time, unless you have already extensively used ThreadScope, read GHC Core, and worked with low-level strictness, unpacking, possibly even rewrite rules. I suspect that the measurable performance benefit from lockless data structures might be tricky to tease out of the noise created by unintentional strictness or unboxing issues. And we'd be much happier with one or two really production quality implementations than even six or seven at a student project level. -- Chris Smith

Florian Hartwig

2:25 a.m.

On 19 March 2012 00:59, Chris Smith wrote:

...

On Mar 18, 2012 6:39 PM, "Florian Hartwig" wrote:

...
GSoC stretches over 13 weeks. I would estimate that implementing a data structure, writing tests, benchmarks, documentation etc. should not take more than 3 weeks (it is supposed to be full-time work, after all), which means that I could implement 4 of them in the time available and still have some slack.

Don't underestimate the time required for performance tuning, and be careful to leave yourself learning time, unless you have already extensively used ThreadScope, read GHC Core, and worked with low-level strictness, unpacking, possibly even rewrite rules. I suspect that the measurable performance benefit from lockless data structures might be tricky to tease out of the noise created by unintentional strictness or unboxing issues. And we'd be much happier with one or two really production quality implementations than even six or seven at a student project level.

-- Chris Smith

Thank you, Hofstadter's law definitely rears its head in many of my projects. I do have some experience with ThreadScope and strictness issues, but you I agree that I'm probably underestimating the time I need to learn. I also agree that my focus would be on quality rather than quantity. I quite like the modularity of this project, because it minimises the chance of having a lot of half-finished but useless code at the end of summer.

Gregory Collins

9:56 a.m.

A lock-free concurrent queue alone would be worth a summer project IMO. G On Mon, Mar 19, 2012 at 3:25 AM, Florian Hartwig < florian.j.hartwig@gmail.com> wrote:

...

...
On Mar 18, 2012 6:39 PM, "Florian Hartwig" wrote:

...
GSoC stretches over 13 weeks. I would estimate that implementing a data structure, writing tests, benchmarks, documentation etc. should not take more than 3 weeks (it is supposed to be full-time work, after all), which means that I could implement 4 of them in the time available and still have some slack.

Don't underestimate the time required for performance tuning, and be careful to leave yourself learning time, unless you have already extensively used ThreadScope, read GHC Core, and worked with low-level strictness, unpacking, possibly even rewrite rules. I suspect that the measurable performance benefit from lockless data structures might be tricky to tease out of the noise created by unintentional strictness or unboxing issues. And we'd be much happier with one or two really production quality implementations

On 19 March 2012 00:59, Chris Smith wrote: than

...
even six or seven at a student project level.

-- Chris Smith

Thank you, Hofstadter's law definitely rears its head in many of my projects. I do have some experience with ThreadScope and strictness issues, but you I agree that I'm probably underestimating the time I need to learn. I also agree that my focus would be on quality rather than quantity. I quite like the modularity of this project, because it minimises the chance of having a lot of half-finished but useless code at the end of summer.

_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

-- Gregory Collins

Florian Hartwig

11:33 a.m.

On 19 March 2012 09:56, Gregory Collins wrote:

...

A lock-free concurrent queue alone would be worth a summer project IMO.

G

Ryan Newton is already doing that (https://github.com/rrnewton/haskell-lockfree-queue).

Ryan Newton

11:46 a.m.

The MichaelScott lockefree queues in that repository pass tests and should work. Additional stress testing and feedback would be appreciated. There are some other queues in the literature that might be worth implementing but I think other data structures are higher priority. As Adam Foltzer mentioned in the trac ticket a really good structure would be the concurrent bags from this paper: http://www.cse.chalmers.se/~tsigas/papers/Lock%20Free%20Bag%20SPAA11.pdf We separately did a C implementation of those and they performed really well in our bake-off of producer consumer data structures (vs. TBB queues, and many other implementations). By the way, we can share the code for this little bake-off as a performance baseline for the Haskell versions. I'm less familiar with the literature on concurrent hash tables. TBB's have been a little bit underwhelming in performance. But it's definitely an important structure. Ditto for priority queues. In any case, I welcome your interest in the topic, Florian! Cheers, -Ryan On Mon, Mar 19, 2012 at 7:33 AM, Florian Hartwig < florian.j.hartwig@gmail.com> wrote:

...

On 19 March 2012 09:56, Gregory Collins wrote:

...
A lock-free concurrent queue alone would be worth a summer project IMO.

G

Ryan Newton is already doing that (https://github.com/rrnewton/haskell-lockfree-queue).

_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

Florian Hartwig

20 Mar 20 Mar

1:53 a.m.

On 19 March 2012 11:46, Ryan Newton wrote:

...

As Adam Foltzer mentioned in the trac ticket a really good structure would be the concurrent bags from this paper:

http://www.cse.chalmers.se/~tsigas/papers/Lock%20Free%20Bag%20SPAA11.pdf

We separately did a C implementation of those and they performed really well in our bake-off of producer consumer data structures (vs. TBB queues, and many other implementations).

I've just read the paper and they look both useful and interesting to implement. Adam mentioned that GHC would need to be extended first, and I can't really judge how much work that would be. Can anyone chime in on how feasible that is?

...

I'm less familiar with the literature on concurrent hash tables. TBB's have been a little bit underwhelming in performance. But it's definitely an important structure. Ditto for priority queues.

The data structures I mentioned were just my initial ideas, I'm open to any other suggestions. In my (limited) experience with parallel programming, queues and priority queues tend to come up quite a bit, but I'd be happy to get input from anyone with more experience regarding what would be useful/important.

Gregory Collins

9:40 a.m.

On Tue, Mar 20, 2012 at 2:53 AM, Florian Hartwig < florian.j.hartwig@gmail.com> wrote:

...

On 19 March 2012 11:46, Ryan Newton wrote:

...
As Adam Foltzer mentioned in the trac ticket a really good structure would be the concurrent bags from this paper:

http://www.cse.chalmers.se/~tsigas/papers/Lock%20Free%20Bag%20SPAA11.pdf

Looks like they rely on thread-local storage, which would have to be worked around in Haskell somehow.

...

I've just read the paper and they look both useful and interesting to implement. Adam mentioned that GHC would need to be extended first, and I can't really judge how much work that would be. Can anyone chime in on how feasible that is?

GHC already has a CAS primitive on MutVar#, it just needs to be extended to MutableArray# and MutableByteArray# (at all of the bit-widths the CAS instruction would support, e.g. see readWordXxArray# in http://www.haskell.org/ghc/docs/latest/html/libraries/ghc-prim-0.2.0.0/GHC-P...). The implementation should be almost identical to casMutVar#. In particular I think you need: casMutArray# :: MutableArray# s a -> Int# -> a -> a -> State# s -> (# State# s, Int#, a #) casWord16MutByteArray :: MutableByteArray# s -> Int# -> Word# -> Word# -> State# s -> (# State# s, Int#, Word#) and equivalents for Word32. Word64, Int16, Int32, Int64. G -- Gregory Collins

Florian Hartwig

28 Mar 28 Mar

1:05 p.m.

Hi again, I just submitted my proposal on the GSoC website. You can find it here: http://www.google-melange.com/gsoc/proposal/review/google/gsoc2012/florianha... I would be very grateful if someone could read over it and tell me if it makes sense and if/how it could be improved. Cheers, Florian

Ryan Newton

29 Mar 29 Mar

4:57 a.m.

...

GHC already has a CAS primitive on MutVar#, it just needs to be extended to MutableArray# and MutableByteArray# (at all of the bit-widths the CAS instruction would support, e.g. see readWordXxArray# in http://www.haskell.org/ghc/docs/latest/html/libraries/ghc-prim-0.2.0.0/GHC-P...). The implementation should be almost identical to casMutVar#. In particular I think you need: casMutArray# :: MutableArray# s a -> Int# -> a -> a -> State# s -> (# State# s, Int#, a #) casWord16MutByteArray :: MutableByteArray# s -> Int# -> Word# -> Word# -> State# s -> (# State# s, Int#, Word#)

FYI, I started working on adding these. I'm hoping to have it working in GHC HEAD for any students who need to use it. To my knowledge the only two patches required to implement casMutVar# were these two (plus the preexisting cas() definition in SMP.h): https://github.com/ghc/ghc/commit/521b792553bacbdb0eec138b150ab0626ea6f36b https://github.com/ghc/ghc/commit/606f6e1cfcb2e79abaadcc5ed643817d2a4585d8 The latter is a bugfix to the former. Florian, your proposal looks good to me ( http://www.google-melange.com/gsoc/proposal/review/google/gsoc2012/florianha...). You touched on the major things we need to know. I just read in your proposal that you started looking into the casMutArray# issue as well. How far have you gotten with that? Do you want to work on this together a bit? I've got an implementation of a casArray# primop that passes a basic test, but I'm not sure if the GC write barrier is correct: https://github.com/rrnewton/ghc/commit/18ed460be111b47a759486677960093d71eef... The ByteArray versions will be more annoying, requiring more variations, but they are also less essential, because the user can always use ForeignPtr and bits-atomic in this case, and I believe for our concurrent data structures we want to store arbitrary pointers (hence casArray#). Cheers, -Ryan

Florian Hartwig

10:31 a.m.

On 29 March 2012 05:57, Ryan Newton wrote:

...

I just read in your proposal that you started looking into the casMutArray# issue as well. How far have you gotten with that? Do you want to work on this together a bit?

I've got an implementation of a casArray# primop that passes a basic test, but I'm not sure if the GC write barrier is correct:

https://github.com/rrnewton/ghc/commit/18ed460be111b47a759486677960093d71eef...

The write barrier is what I got stuck on as well. I don't know much about Haskell's GC. I'm going to read up on it, but my Master's project is due in in 3 weeks, so it's difficult to find the time right now. I'd be happy to work with you on this, but it'll probably have to wait until then.

Florian Hartwig

11:14 a.m.

On 29 March 2012 at 12:01 PM, Heinrich Apfelmus <apfelmus at quantentunnel.de> wrote:

...

Since I don't know much about concurrency, I have a simple question: what is the difference between atomic compare-and-swap and software transactional memory? Both are lock-free?

Well, terminology-wise it would probably be more accurate to say that you can implement lock-free algorithms using both (i.e. it's not really CAS and STM that are lock-free, but the algorithms implemented using them). But maybe this is a pedantic distinction. As to the difference between the two: CAS is just a machine instruction that takes an old "expected" value, a new value and an address and updates the memory pointed to by the address iff it contains the old/expected value. All this happens atomically, so you avoid the potential conflicts between threads concurrently reading from and writing to the same address. STM is much higher-level. It's an abstraction that allows the programmer to treat any sequence of reads and writes as a single atomic operation. This is, as the name says, implemented in software. So while the two are related, CAS is a machine primitive that works for a single operation and on a single word while STM is a software abstraction that isolates sequences of operations on multiple memory locations from each other.

...

Is it possible to implement every data structure based on CAS in terms of STM? What are the drawbacks? The other way round?

I think so. Atomically reading and writing a single memory location (which CAS does) is just a very simple transaction. But using a CAS instruction should be more efficient, since STM has to maintain a transaction log and commit transactions, which creates some overhead. I hope this makes some sense. Cheers, Florian

Heinrich Apfelmus

12:30 p.m.

Florian Hartwig wrote:

...

Heinrich Apfelmus wrote:

So while the two are related, CAS is a machine primitive that works for a single operation and on a single word while STM is a software abstraction that isolates sequences of operations on multiple memory locations from each other.

...
Is it possible to implement every data structure based on CAS in terms of STM? What are the drawbacks? The other way round?

I think so. Atomically reading and writing a single memory location (which CAS does) is just a very simple transaction. But using a CAS instruction should be more efficient, since STM has to maintain a transaction log and commit transactions, which creates some overhead.

Ah, I see. In that case, it may be worthwhile to implement the CAS instruction in terms of STM as well and measure the performance difference this makes for the final data structure. After all, STM is a lot more compositional than CAS, so I'd like to know whether the loss of expressiveness is worth the gain in performance. Best regards, Heinrich Apfelmus -- http://apfelmus.nfshost.com

Ryan Newton

1:20 p.m.

...

I think so. Atomically reading and writing a single memory location

...
(which CAS does) is just a very simple transaction. But using a CAS instruction should be more efficient, since STM has to maintain a transaction log and commit transactions, which creates some overhead.

Ah, I see. In that case, it may be worthwhile to implement the CAS instruction in terms of STM as well and measure the performance difference this makes for the final data structure. After all, STM is a lot more compositional than CAS, so I'd like to know whether the loss of expressiveness is worth the gain in performance.

There's one annoying hitch with doing apples-to-apples comparisons here. The problem is that CAS falls short of being a hardware-accelerated version of a simple transaction (read TVar, (==) against expected value, conditionally update TVar), because CAS is based on pointer equality rather than true equality. (eq? rather than equal? for any Schemers out there.) For this reason our "Fake" version of CAS -- for older GHCs and for performance comparison -- has to use reallyUnsafePointerEquality#: http://hackage.haskell.org/packages/archive/IORefCAS/0.2/doc/html/Data-CAS-I... But we do provide a "CASable" type class in that package which is precisely for comparing the performance of: - Hardware CAS - Fake CAS -- atomicModifyIORef + ptrEquality - Foreign CAS -- gcc intrinsic + function call Cheers, -Ryan

Tim Harris (RESEARCH)

1:46 p.m.

Hi, Somewhat related to this... Next month we have a paper coming up at EuroSys about a middle-ground between using STM and programming directly with CAS: http://research.microsoft.com/en-us/um/people/tharris/papers/2012-eurosys.pd... This was done in the context of shared memory data structures in C/C++, rather than Haskell. It might be interesting to see how the results carry over to Haskell, e.g. adding short forms of specialized transactions that interact correctly with normal STM-Haskell transactions. In the paper we have some examples of using short specialized transactions for the fast paths in data structures, while keeping the full STM available as a fall-back for expressing the cases that cannot be implemented using short transactions. --Tim -----Original Message----- From: haskell-cafe-bounces@haskell.org [mailto:haskell-cafe-bounces@haskell.org] On Behalf Of Heinrich Apfelmus Sent: 29 March 2012 13:30 To: haskell-cafe@haskell.org Subject: Re: [Haskell-cafe] Google Summer of Code - Lock-free data structures Florian Hartwig wrote:

...

Heinrich Apfelmus wrote:

So while the two are related, CAS is a machine primitive that works for a single operation and on a single word while STM is a software abstraction that isolates sequences of operations on multiple memory locations from each other.

...
Is it possible to implement every data structure based on CAS in terms of STM? What are the drawbacks? The other way round?

I think so. Atomically reading and writing a single memory location (which CAS does) is just a very simple transaction. But using a CAS instruction should be more efficient, since STM has to maintain a transaction log and commit transactions, which creates some overhead.

Ben

5 Apr 5 Apr

6:46 p.m.

perhaps it is too late to suggest things for GSOC -- but stephen tetley on a different thread pointed at aaron turon's work, which there's a very interesting new concurrency framework he calls "reagents" which seems to give the best of all worlds : it is declarative and compositional like STM, but gives performance akin to hand-coded lock-free data structures. he seems to have straddled the duality of isolation vs message-passing nicely, and can subsume things like actors and the join calculus. http://www.ccs.neu.edu/home/turon/reagents.pdf he has a BSD licensed library in scala at https://github.com/aturon/ChemistrySet if someone doesn't want to pick this up for GSOC i might have a hand at implementing it myself. b On Mar 29, 2012, at 6:46 AM, Tim Harris (RESEARCH) wrote:

...

Hi,

Somewhat related to this...

Next month we have a paper coming up at EuroSys about a middle-ground between using STM and programming directly with CAS:

http://research.microsoft.com/en-us/um/people/tharris/papers/2012-eurosys.pd...

This was done in the context of shared memory data structures in C/C++, rather than Haskell.

It might be interesting to see how the results carry over to Haskell, e.g. adding short forms of specialized transactions that interact correctly with normal STM-Haskell transactions.

In the paper we have some examples of using short specialized transactions for the fast paths in data structures, while keeping the full STM available as a fall-back for expressing the cases that cannot be implemented using short transactions.

--Tim

-----Original Message----- From: haskell-cafe-bounces@haskell.org [mailto:haskell-cafe-bounces@haskell.org] On Behalf Of Heinrich Apfelmus Sent: 29 March 2012 13:30 To: haskell-cafe@haskell.org Subject: Re: [Haskell-cafe] Google Summer of Code - Lock-free data structures

Florian Hartwig wrote:

...
Heinrich Apfelmus wrote:

So while the two are related, CAS is a machine primitive that works for a single operation and on a single word while STM is a software abstraction that isolates sequences of operations on multiple memory locations from each other.

...
Is it possible to implement every data structure based on CAS in terms of STM? What are the drawbacks? The other way round?

I think so. Atomically reading and writing a single memory location (which CAS does) is just a very simple transaction. But using a CAS instruction should be more efficient, since STM has to maintain a transaction log and commit transactions, which creates some overhead.

Ah, I see. In that case, it may be worthwhile to implement the CAS instruction in terms of STM as well and measure the performance difference this makes for the final data structure. After all, STM is a lot more compositional than CAS, so I'd like to know whether the loss of expressiveness is worth the gain in performance.

Best regards, Heinrich Apfelmus

-- http://apfelmus.nfshost.com

_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

Ben Gamari

7:05 p.m.

Ben writes:

...

perhaps it is too late to suggest things for GSOC --

but stephen tetley on a different thread pointed at aaron turon's work, which there's a very interesting new concurrency framework he calls "reagents" which seems to give the best of all worlds : it is declarative and compositional like STM, but gives performance akin to hand-coded lock-free data structures. he seems to have straddled the duality of isolation vs message-passing nicely, and can subsume things like actors and the join calculus.

http://www.ccs.neu.edu/home/turon/reagents.pdf

he has a BSD licensed library in scala at

https://github.com/aturon/ChemistrySet

if someone doesn't want to pick this up for GSOC i might have a hand at implementing it myself.

Keep use in the loop if you do. I have a very nice application that has been needing a nicer approach to concurrency than IORefs but really can't afford STM. Cheers, - Ben

Ryan Newton

9:23 p.m.

+1 -- the reagents model is interesting and it would be good to see a Haskell implementation. On Thu, Apr 5, 2012 at 3:05 PM, Ben Gamari wrote:

...

Ben writes:

...
perhaps it is too late to suggest things for GSOC --

but stephen tetley on a different thread pointed at aaron turon's work, which there's a very interesting new concurrency framework he calls "reagents" which seems to give the best of all worlds : it is declarative and compositional like STM, but gives performance akin to hand-coded lock-free data structures. he seems to have straddled the duality of isolation vs message-passing nicely, and can subsume things like actors and the join calculus.

http://www.ccs.neu.edu/home/turon/reagents.pdf

he has a BSD licensed library in scala at

https://github.com/aturon/ChemistrySet

if someone doesn't want to pick this up for GSOC i might have a hand at implementing it myself.

Keep use in the loop if you do. I have a very nice application that has been needing a nicer approach to concurrency than IORefs but really can't afford STM.

Cheers,

- Ben

_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

Heinrich Apfelmus

6 Apr 6 Apr

12:38 p.m.

Ben wrote:

...

perhaps it is too late to suggest things for GSOC --

but stephen tetley on a different thread pointed at aaron turon's work, which there's a very interesting new concurrency framework he calls "reagents" which seems to give the best of all worlds : it is declarative and compositional like STM, but gives performance akin to hand-coded lock-free data structures. he seems to have straddled the duality of isolation vs message-passing nicely, and can subsume things like actors and the join calculus.

http://www.ccs.neu.edu/home/turon/reagents.pdf

he has a BSD licensed library in scala at

https://github.com/aturon/ChemistrySet

if someone doesn't want to pick this up for GSOC i might have a hand at implementing it myself.

That looks great! While I didn't take the time to understand the concurrency model in detail, the overall idea is to use arrows that can be run atomically runAtomically :: Reagent a b -> (a -> IO b) This is very similar to STM: combining computations within the monad/arrow is atomic while combining computations "outside" the monad/arrow can interleave them. runAtomically (f . g) -- atomic runAtomically f . runAtomically g -- interleaving Actually, it turns out that the Reagent arrow is also a monad, but the author seems to claim that the "static" arrow style enables certain optimizations. I haven't checked his model in detail to see whether this is really the case and how exactly it differs from STM, but we know that situations like this happen for parser combinators. Maybe it's enough to recast reagents as an applicative functor? To summarize: the way I understand it is that it's apparently possible to improve the STM monad by turning it into an arrow. (I refer to "STM" in a very liberal sense here: whether memory is transactional or not is unimportant, the only thing that matters is a computation that composes atomically.) Best regards, Heinrich Apfelmus -- http://apfelmus.nfshost.com

Ben

20 Apr 20 Apr

5:05 p.m.

New subject: why are applicative functors (often) faster than monads? (WAS Google Summer of Code - Lock-free data structures)

heinrich and all -- thanks for the illuminating comments, as usual. i've had a little bit of time to play around with this and here's what i've concluded (hopefully i'm not mistaken.) 1 - while composeability makes STM a great silver bullet, there are other composable lower level paradigms. aaron has identified a few fundamental algorithms that appear to be composable, and used a lot in concurrent algorithms / data structures : k-CAS, exponential backoff, helping and elimination. 2 - k-CAS (and more generally k-RMW) is composable. i'm not exactly sure if i'd call k-CAS monadic but it is at least applicative (i'm not sure what the exact definition of k-CAS is. a bunch of 1-CASs done atomically seems applicative; allowing them to interact seems monadic. certainly k-RMW is monadic.) while it is possible to implement STM on top of k-CAS and vice versa, k-CAS can get you closer to the metal, and if you can special case 1-CAS to hardware you will win on a lot of benchmarks. a lot of concurrent algorithms only need 1-CAS. 3 - backoff, elimination and helping help scalability a lot, so you want support for them. elimination and helping require communication between transactions, whereas STM is all about isolation, so reagents are fundamentally different in this regard. reagents as implemented in the paper are not entirely monadic (by accident, i think the author intended them to be.) as far as i can see he uses an applicative form of k-CAS, and the reagents on top of it are applicative : his "computed" combinator (monadic bind) does not allow post-composition (it has no continuation.) there is no reason why it could not be built on top of a monadic k-RMW and be fully monadic. however he is able to recognize and special-case 1-CAS, which gives great performance of course. however, this does bring up a general question : why are applicative functors (often) faster than monads? malcolm wallace mentioned this is true for polyparse, and heinrich mentioned this is true more generally. is there a yoga by which one can write monadic functors which have a specialized, faster applicative implementation? right now i'm reading up on k-CAS / k-RMW implementations, and i think i'm going to start writing that monad before moving on to elimination / helping. i'm finding a two things difficult : - it is hard to represent a unified transaction log because of heterogeneous types, and - allowing a fully monadic interface makes it hard for me to special case 1-CAS (or applicative k-CAS.) there are workarounds for the first issue (separate logs for each reference, or a global clock as in Transactional Locking II.) for the second, right now i'm wondering if i'm going to have to write a compiler for a little DSL; i'd like to be able to exploit applicative performance gains generally, and special case 1-CAS. best, ben On Apr 6, 2012, at 5:38 AM, Heinrich Apfelmus wrote:

...

Ben wrote:

...
perhaps it is too late to suggest things for GSOC -- but stephen tetley on a different thread pointed at aaron turon's work, which there's a very interesting new concurrency framework he calls "reagents" which seems to give the best of all worlds : it is declarative and compositional like STM, but gives performance akin to hand-coded lock-free data structures. he seems to have straddled the duality of isolation vs message-passing nicely, and can subsume things like actors and the join calculus. http://www.ccs.neu.edu/home/turon/reagents.pdf he has a BSD licensed library in scala at https://github.com/aturon/ChemistrySet if someone doesn't want to pick this up for GSOC i might have a hand at implementing it myself.

That looks great! While I didn't take the time to understand the concurrency model in detail, the overall idea is to use arrows that can be run atomically

runAtomically :: Reagent a b -> (a -> IO b)

This is very similar to STM: combining computations within the monad/arrow is atomic while combining computations "outside" the monad/arrow can interleave them.

runAtomically (f . g) -- atomic runAtomically f . runAtomically g -- interleaving

Actually, it turns out that the Reagent arrow is also a monad, but the author seems to claim that the "static" arrow style enables certain optimizations. I haven't checked his model in detail to see whether this is really the case and how exactly it differs from STM, but we know that situations like this happen for parser combinators. Maybe it's enough to recast reagents as an applicative functor?

To summarize: the way I understand it is that it's apparently possible to improve the STM monad by turning it into an arrow. (I refer to "STM" in a very liberal sense here: whether memory is transactional or not is unimportant, the only thing that matters is a computation that composes atomically.)

Best regards, Heinrich Apfelmus

-- http://apfelmus.nfshost.com

_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

KC

9:54 p.m.

New subject: why are applicative functors (often) faster than monads? (WAS Google Summer of Code - Lock-free data structures)

Think of the differences (and similarities) of Applicative Functors and Monads and the extra context that monads carry around. -- -- Regards, KC

Ben

11:05 p.m.

New subject: why are applicative functors (often) faster than monads? (WAS Google Summer of Code - Lock-free data structures)

i'm not sure what your email is pointing at. if it is unclear, i understand the difference between applicative and monadic. i suppose the easy answer to why applicative can be faster than monadic is that you can give a more specialized instance declaration. i was just wondering if there was a way to make a monad recognize when it is being used applicatively, but that is probably hard in general. b On Apr 20, 2012, at 2:54 PM, KC wrote:

...

Think of the differences (and similarities) of Applicative Functors and Monads and the extra context that monads carry around.

-- -- Regards, KC

KC

11:41 p.m.

New subject: why are applicative functors (often) faster than monads? (WAS Google Summer of Code - Lock-free data structures)

Sorry, I thought you or someone was asking why are Applicative Functors faster in general than Monads. Functional programming is structured function calling to achieve a result where the functions can be evaluated in an unspecified order; I thought Applicative Functors had the same unspecified evaluation order; whereas, Monads could carry some sequencing of computations which has the extra overhead of continuation passing. Do I have that correct? On Fri, Apr 20, 2012 at 4:05 PM, Ben wrote:

...

i'm not sure what your email is pointing at. if it is unclear, i understand the difference between applicative and monadic. i suppose the easy answer to why applicative can be faster than monadic is that you can give a more specialized instance declaration. i was just wondering if there was a way to make a monad recognize when it is being used applicatively, but that is probably hard in general.

b

On Apr 20, 2012, at 2:54 PM, KC wrote:

...
Think of the differences (and similarities) of Applicative Functors and Monads and the extra context that monads carry around.

-- -- Regards, KC

-- -- Regards, KC

Ben

21 Apr 21 Apr

12:29 a.m.

New subject: why are applicative functors (often) faster than monads? (WAS Google Summer of Code - Lock-free data structures)

the sequencing matters for applicative functors. from McBride and Patterson [1]: "The idea is that 'pure' embeds pure computations into the pure fragment of an effectful world -- the resulting computations may thus be shunted around freely, as long as the order of the genuinely effectful computations is preserved." it is interesting to note that sequencing only matters a little for k-CAS : you just have to read before you write, but you can do the reads and writes in any order (as long as it is ultimately atomic.) b [1] McBride C, Patterson R. "Applicative programming with effects" Journal of Functional Programming 18:1 (2008), pages 1-13. On Apr 20, 2012, at 4:41 PM, KC wrote:

...

Sorry, I thought you or someone was asking why are Applicative Functors faster in general than Monads.

Functional programming is structured function calling to achieve a result where the functions can be evaluated in an unspecified order; I thought Applicative Functors had the same unspecified evaluation order; whereas, Monads could carry some sequencing of computations which has the extra overhead of continuation passing.

Do I have that correct?

On Fri, Apr 20, 2012 at 4:05 PM, Ben wrote: i'm not sure what your email is pointing at. if it is unclear, i understand the difference between applicative and monadic. i suppose the easy answer to why applicative can be faster than monadic is that you can give a more specialized instance declaration. i was just wondering if there was a way to make a monad recognize when it is being used applicatively, but that is probably hard in general.

b

On Apr 20, 2012, at 2:54 PM, KC wrote:

...
Think of the differences (and similarities) of Applicative Functors and Monads and the extra context that monads carry around.

-- -- Regards, KC

-- -- Regards, KC

Heinrich Apfelmus

8:17 a.m.

New subject: why are applicative functors (often) faster than monads? (WAS Google Summer of Code - Lock-free data structures)

Ben wrote:

...

however, this does bring up a general question : why are applicative functors (often) faster than monads? malcolm wallace mentioned this is true for polyparse, and heinrich mentioned this is true more generally. is there a yoga by which one can write monadic functors which have a specialized, faster applicative implementation?

I'm not knowledgeable enough on the multicore stuff, but I can comment on the monad vs applicative issue. Applicative functors are not per se faster than monads, it's just that the former can encode static analysis while the latter can't. As you can see from the type of "bind" (>>=) :: m a -> (a -> m b) -> m b the structure of the computation in the second argument, i.e. its various side effects, can depend on a value of type a , which is only available at "run-time". In contrast, the type of "apply" (<*>) :: m (a -> b) -> m a -> m b makes it clear that the side effects are the same, no matter what the value of a will be at run-time. In other words, the structure of the computation is known statically. For parsers, interesting analyses are * Does a parser accept the empty set? * What are the first characters that a parser can accept? The answers can be obtained easily enough from an applicative functors, for instance acceptsEmpty (pure x) = True acceptsEmpty (f <*> g) = acceptsEmpty f && acceptsEmpty g But the corresponding analysis for monadic parsers is either harder or hopelessly inefficient because we don't know the structure of the parser until we run it on some input. See also this answer on StackOverflow: http://stackoverflow.com/a/7863380/403805 Best regards, Heinrich Apfelmus -- http://apfelmus.nfshost.com

Gregory Collins

29 Mar 29 Mar

1:01 p.m.

On Thu, Mar 29, 2012 at 6:57 AM, Ryan Newton wrote:

...

The ByteArray versions will be more annoying, requiring more variations, but they are also less essential, because the user can always use ForeignPtr and bits-atomic in this case, and I believe for our concurrent data structures we want to store arbitrary pointers (hence casArray#).

This is true, although using bits-atomic does a function call (i.e the calls are not inlined), which would be pretty bad for performance. G -- Gregory Collins

Ryan Newton

1:12 p.m.

On Thu, Mar 29, 2012 at 9:01 AM, Gregory Collins wrote:

...

On Thu, Mar 29, 2012 at 6:57 AM, Ryan Newton wrote:

...
The ByteArray versions will be more annoying, requiring more variations, but they are also less essential, because the user can always use ForeignPtr and bits-atomic in this case, and I believe for our concurrent data structures we want to store arbitrary pointers (hence casArray#).

This is true, although using bits-atomic does a function call (i.e the calls are not inlined), which would be pretty bad for performance.

Yes, absolutely... I'd like to add the byte array versions. Actually, those don't have GC write barriers so they should be much easier to get right!

Heinrich Apfelmus

8:01 a.m.

Florian Hartwig wrote:

...

Hi everyone, I would like to do the GSoC project outlined in http://hackage.haskell.org/trac/summer-of-code/ticket/1608

One of Haskell's great strengths is its support for all kinds of concurrent and parallel programmming, so I think that the Haskell ecosystem would benefit from having a wider range of efficient concurrent data structures. Also, more selfishly, I remember reading the article in CACM last summer and thinking that the whole idea of updating data structures directly using atomic compare-and-swap was really cool, so I would love to have an excuse to play around with it.

Some (initial) ideas for lock-free data structures worth implementing: 1. A lock-free priority queue, e.g. using the approach based on skip-lists explained in [2] 2. Stacks, see [1] and [3] 3. Hash tables [4] but if anyone has other suggestions, I would obviously happy to hear them.

Since I don't know much about concurrency, I have a simple question: what is the difference between atomic compare-and-swap and software transactional memory? Both are lock-free? Is it possible to implement every data structure based on CAS in terms of STM? What are the drawbacks? The other way round? Best regards, Heinrich Apfelmus -- http://apfelmus.nfshost.com

Eugene Kirpichov

8:39 a.m.

Though of course you can implement CAS in terms of STM, CAS is much more low-level and will probably be many times (though not asymptotically) faster. On Thu, Mar 29, 2012 at 12:01 PM, Heinrich Apfelmus wrote:

...

Florian Hartwig wrote:

...
Hi everyone, I would like to do the GSoC project outlined in http://hackage.haskell.org/trac/summer-of-code/ticket/1608

One of Haskell's great strengths is its support for all kinds of concurrent and parallel programmming, so I think that the Haskell ecosystem would benefit from having a wider range of efficient concurrent data structures. Also, more selfishly, I remember reading the article in CACM last summer and thinking that the whole idea of updating data structures directly using atomic compare-and-swap was really cool, so I would love to have an excuse to play around with it.

Some (initial) ideas for lock-free data structures worth implementing: 1. A lock-free priority queue, e.g. using the approach based on skip-lists explained in [2] 2. Stacks, see [1] and [3] 3. Hash tables [4] but if anyone has other suggestions, I would obviously happy to hear them.

Since I don't know much about concurrency, I have a simple question: what is the difference between atomic compare-and-swap and software transactional memory? Both are lock-free? Is it possible to implement every data structure based on CAS in terms of STM? What are the drawbacks? The other way round?

Best regards, Heinrich Apfelmus

-- http://apfelmus.nfshost.com

_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe

-- Eugene Kirpichov Principal Engineer, Mirantis Inc. http://www.mirantis.com/ Editor, http://fprog.ru/

4831

Age (days ago)

4864

Last active (days ago)

List overview

Download

28 comments

10 participants

participants (10)

Ben
Ben Gamari
Chris Smith
Eugene Kirpichov
Florian Hartwig
Gregory Collins
Heinrich Apfelmus
KC
Ryan Newton
Tim Harris (RESEARCH)