Best ways to achieve throughput, for large M:N ratio of STM threads, with hot TVar updates?

YueCompl

23 Jul 2020 23 Jul '20

2:51 p.m.

Hello Cafe, I'm working on an in-memory database, in Client/Server mode I just let each connected client submit remote procedure call running in its dedicated lightweight thread, modifying TVars in RAM per its business needs, then in case many clients connected concurrently and trying to insert new data, if they are triggering global index (some TVar) update, the throughput would drop drastically. I reduced the shared state to a simple int counter by TVar, got same symptom. While the parallelism feels okay when there's no hot TVar conflicting, or M is not much greater than N. As an empirical test workload, I have a `+RTS -N10` server process, it handles 10 concurrent clients okay, got ~5x of single thread throughput; but in handling 20 concurrent clients, each of the 10 CPUs can only be driven to ~10% utilization, the throughput seems even worse than single thread. More clients can even drive it thrashing without much progressing. I can understand that pure STM doesn't scale well after reading [1], and I see it suggested [7] attractive and planned future work toward that direction. But I can't find certain libraries or frameworks addressing large M over small N scenarios, [1] experimented with designated N parallelism, and [7] is rather theoretical to my empirical needs. Can you direct me to some available library implementing the methodology proposed in [7] or other ways tackling this problem? I think the most difficult one is that a transaction should commit with global indices (with possibly unique constraints) atomically updated, and rollback with any violation of constraints, i.e. transactions have to cover global states like indices. Other problems seem more trivial than this. Specifically, [7] states:

...

It must be emphasized that all of the mechanisms we deploy originate, in one form or another, in the database literature from the 70s and 80s. Our contribution is to adapt these techniques to software transactional memory, providing more effective solutions to important STM problems than prior proposals.

Attachments:

attachment.html (text/html — 3.7 KB)

Show replies by date

Joachim Durchholz

23 Jul 23 Jul

4:25 p.m.

While I can't contribute any Haskell knowledge, I know that many threads updating the same variable is the worst thing you can do; not only do you create a single bottleneck, if you have your threads running on multiple cores you get CPU pipeline stalls, L1 cache line flushes, and/or complicated cache coherency protocols executed between cores. It's not cheap: each of these mechanisms can take hundreds of CPU cycles, for a CPU that can execute multiple instructions per CPU cycle. Incrementing a global counter is a really hard problem in multithreading... I believe this is the reason why databases typically implement a SEQUENCE mechanism, and these sequences are usually implemented as "whenever a transaction asks for a sequence number, reserve a block of 1,000 numbers for it so it can retrieve 999 additional numbers without the synchronization overhead. This is also why real databases use transactions - these do not just isolate processes from each other's updates, they also allow the DB to let the transaction work on a snapshot and do all the synchronization once, during COMMIT. And, as you just discovered, it's one of the major optimization areas in database engines :-) TL;DR for the bad news: I suspect your problem is just unavoidable However, I see a workaround: delayed index update. Have each index twice: last-known and next-future. last-known is what was created during the last index update. You need an append-only list of records that had an index field update, and all searches that use the index will also have to do a linear search in that list. next-future is built in the background. It takes last-known and the updates from the append-only list, and generates a new index. Once next-future is finished, replace last-known with it. You still need to to a global lock while replacing indexes, but you don't have to lock the index for every single update but just once. You'll have to twiddle with parameters such as "at what point do I start a new index build", and you'll have to make sure that your linear list isn't yet another bottleneck (there are lock-free data structures to achieve such a thing, but these are complicated; or you can tell application programmers to try and collect as many updates as possible in a transaction so the number of synchronization points is smaller; however, too-large transactions can generate CPU cache overflows if the collected update data becomes too large, so there's a whole lot of tweaking, studying real performance data, hopefully finding the right set of diagnostic information to collect that allow the DB to automatically choose the right point to do its updates, etc. pp.) TL;DR for the good news: You can coalesce N updates into one and divide the CPU core coordination overhead by a factor of N. You'll increase the bus pressure, so there's tons of fine tuning you can do (or avoid) after getting the first 90% of the speedup. (I'm drawing purely speculative numbers out of my hat here.) Liability: You will want to add transactions and (likely) optimistic locking, if you don't have that already: Transaction boundaries are the natural point for coalescing updates. Regards, Jo

Compl Yue

24 Jul 24 Jul

3:48 p.m.

Hi Jo, I think you are totally right about the situation, and I just want to make it clear that, I already chose STM and lightweight threads as GHC implemented them, STM for transactions and optimistic locking, lightweight threads for sophisticated smart scheduling. The global counter is only used to reveal the technical traits of my situation, it's of course not a requirement of my business needs. I'm not hurry for thorough performance optimization at current stage (PoC prototyping not finished yet), as long as the performance is reasonable, but the thrashing behavior really frightened me and I have to take it as a serious concern for the time being. Fortunately it doesn't feel so scary as it first appeared to me, after taking others' suggestions, I'll experiment more with these new information to me and see what will come out. Thanks with regards, Compl On 2020/7/24 上午12:25, Joachim Durchholz wrote:

...

While I can't contribute any Haskell knowledge, I know that many threads updating the same variable is the worst thing you can do; not only do you create a single bottleneck, if you have your threads running on multiple cores you get CPU pipeline stalls, L1 cache line flushes, and/or complicated cache coherency protocols executed between cores. It's not cheap: each of these mechanisms can take hundreds of CPU cycles, for a CPU that can execute multiple instructions per CPU cycle.

Incrementing a global counter is a really hard problem in multithreading...

I believe this is the reason why databases typically implement a SEQUENCE mechanism, and these sequences are usually implemented as "whenever a transaction asks for a sequence number, reserve a block of 1,000 numbers for it so it can retrieve 999 additional numbers without the synchronization overhead.

This is also why real databases use transactions - these do not just isolate processes from each other's updates, they also allow the DB to let the transaction work on a snapshot and do all the synchronization once, during COMMIT. And, as you just discovered, it's one of the major optimization areas in database engines :-)

TL;DR for the bad news: I suspect your problem is just unavoidable

However, I see a workaround: delayed index update. Have each index twice: last-known and next-future. last-known is what was created during the last index update. You need an append-only list of records that had an index field update, and all searches that use the index will also have to do a linear search in that list. next-future is built in the background. It takes last-known and the updates from the append-only list, and generates a new index. Once next-future is finished, replace last-known with it. You still need to to a global lock while replacing indexes, but you don't have to lock the index for every single update but just once. You'll have to twiddle with parameters such as "at what point do I start a new index build", and you'll have to make sure that your linear list isn't yet another bottleneck (there are lock-free data structures to achieve such a thing, but these are complicated; or you can tell application programmers to try and collect as many updates as possible in a transaction so the number of synchronization points is smaller; however, too-large transactions can generate CPU cache overflows if the collected update data becomes too large, so there's a whole lot of tweaking, studying real performance data, hopefully finding the right set of diagnostic information to collect that allow the DB to automatically choose the right point to do its updates, etc. pp.)

TL;DR for the good news: You can coalesce N updates into one and divide the CPU core coordination overhead by a factor of N. You'll increase the bus pressure, so there's tons of fine tuning you can do (or avoid) after getting the first 90% of the speedup. (I'm drawing purely speculative numbers out of my hat here.) Liability: You will want to add transactions and (likely) optimistic locking, if you don't have that already: Transaction boundaries are the natural point for coalescing updates.

Regards, Jo _______________________________________________ Haskell-Cafe mailing list To (un)subscribe, modify options or view archives go to: http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe Only members subscribed via the mailman list are allowed to post.

Joachim Durchholz

29 Jul 29 Jul

5:37 p.m.

Am 24.07.20 um 17:48 schrieb Compl Yue via Haskell-Cafe:

...

The global counter is only used to reveal the technical traits of my situation, it's of course not a requirement of my business needs.

Given the other discussion here, I'm not sure if it's really relevant to your situation, but that stats counter could indeed be causing lock contention. Which means your numbers may be skewed, and you may be drawing wrong conclusions - which is actually commonplace in benchmarking. Two things you could do: 1) Leave the global counter out and see whether the running times vary. There's still a chance that while the overall running time is the same, the code might now be hitting a different bottleneck. Or maybe the counter isn't the bottleneck but it would become one once you have done the other optimizations. So that experiment is cheap but gives you no more than a preliminary result. 2) Let each thread collect its own statistics, and coalesce into the global counter only once in a while. (Vary the "once in a while" determination and see whether it changes anything.) Just my 2c from the sideline. Regards, Jo

Compl Yue

30 Jul 30 Jul

4:30 a.m.

Hi Jo, Thanks anyway and FYI the global counter originally served as a source for unique entity id, then later I have replaced it with UUID from uuid package, seems not a problem since then. Regards, Compl On 2020/7/30 上午1:37, Joachim Durchholz wrote:

...

Am 24.07.20 um 17:48 schrieb Compl Yue via Haskell-Cafe:

...
The global counter is only used to reveal the technical traits of my situation, it's of course not a requirement of my business needs.

Given the other discussion here, I'm not sure if it's really relevant to your situation, but that stats counter could indeed be causing lock contention. Which means your numbers may be skewed, and you may be drawing wrong conclusions - which is actually commonplace in benchmarking.

Two things you could do: 1) Leave the global counter out and see whether the running times vary. There's still a chance that while the overall running time is the same, the code might now be hitting a different bottleneck. Or maybe the counter isn't the bottleneck but it would become one once you have done the other optimizations. So that experiment is cheap but gives you no more than a preliminary result. 2) Let each thread collect its own statistics, and coalesce into the global counter only once in a while. (Vary the "once in a while" determination and see whether it changes anything.)

Just my 2c from the sideline.

Regards, Jo _______________________________________________ Haskell-Cafe mailing list To (un)subscribe, modify options or view archives go to: http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe Only members subscribed via the mailman list are allowed to post.

Christopher Allen

23 Jul 23 Jul

4:57 p.m.

It seems like you know how to run practical tests for tuning thread count and contention for throughput. Part of the reason you haven't gotten a super clear answer is "it depends." You give up fairness when you use STM instead of MVars or equivalent structures. That means a long running transaction might get stampeded by many small ones invalidating it over and over. The long-running transaction might never clear if the small transactions keep moving the cheese. I mention this because transaction runtime and size and count all affect throughput and latency. What might be ideal for one pattern of work might not be ideal for another. Optimizing for overall throughput might make the contention and fairness problems worse too. I've done practical tests to optimize this in the past, both for STM in Haskell and for RDBMS workloads. The next step is sometimes figuring out whether you really need a data structure within a single STM container or if perhaps you can break up your STM container boundaries into zones or regions that roughly map onto update boundaries. That should make the transactions churn less. On the outside chance you do need to touch more than one container in a transaction, well, they compose. e.g. https://hackage.haskell.org/package/stm-containers https://hackage.haskell.org/package/ttrie It also sounds a bit like your question bumps into Amdahl's Law a bit. All else fails, stop using STM and find something more tuned to your problem space. Hope this helps, Chris Allen On Thu, Jul 23, 2020 at 9:53 AM YueCompl via Haskell-Cafe < haskell-cafe@haskell.org> wrote:

...

Hello Cafe,

I'm working on an in-memory database, in Client/Server mode I just let each connected client submit remote procedure call running in its dedicated lightweight thread, modifying TVars in RAM per its business needs, then in case many clients connected concurrently and trying to insert new data, if they are triggering global index (some TVar) update, the throughput would drop drastically. I reduced the shared state to a simple int counter by TVar, got same symptom. While the parallelism feels okay when there's no hot TVar conflicting, or M is not much greater than N.

As an empirical test workload, I have a `+RTS -N10` server process, it handles 10 concurrent clients okay, got ~5x of single thread throughput; but in handling 20 concurrent clients, each of the 10 CPUs can only be driven to ~10% utilization, the throughput seems even worse than single thread. More clients can even drive it thrashing without much progressing.

I can understand that pure STM doesn't scale well after reading [1], and I see it suggested [7] attractive and planned future work toward that direction.

But I can't find certain libraries or frameworks addressing large M over small N scenarios, [1] experimented with designated N parallelism, and [7] is rather theoretical to my empirical needs.

Can you direct me to some available library implementing the methodology proposed in [7] or other ways tackling this problem?

I think the most difficult one is that a transaction should commit with global indices (with possibly unique constraints) atomically updated, and rollback with any violation of constraints, i.e. transactions have to cover global states like indices. Other problems seem more trivial than this.

Specifically, [7] states:

...
It must be emphasized that all of the mechanisms we deploy originate, in one form or another, in the database literature from the 70s and 80s. Our contribution is to adapt these techniques to software transactional memory, providing more effective solutions to important STM problems than prior proposals.

I wonder any STM based library has simplified those techniques to be composed right away? I don't really want to implement those mechanisms by myself, rebuilding many wheels from scratch.

Best regards, Compl

[1] Comparing the performance of concurrent linked-list implementations in Haskell https://simonmar.github.io/bib/papers/concurrent-data.pdf

[7] M. Herlihy and E. Koskinen. Transactional boosting: a methodology for highly-concurrent transactional objects. In Proc. of PPoPP ’08, pages 207–216. ACM Press, 2008. https://www.cs.stevens.edu/~ejk/papers/boosting-ppopp08.pdf

_______________________________________________ Haskell-Cafe mailing list To (un)subscribe, modify options or view archives go to: http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe Only members subscribed via the mailman list are allowed to post.

-- Chris Allen Currently working on http://haskellbook.com

Compl Yue

24 Jul 24 Jul

6:11 a.m.

Thanks Chris, I confess I didn't pay enough attention to STM specialized container libraries by far, I skimmed through the description of stm-containers and ttrie, and feel they would definitely improve my code's performance in case I limit the server's parallelism within hardware capabilities. That may because I'm still prototyping the api and infrastructure for correctness, so even `TVar (HashMap k v)` performs okay for me at the moment, only if at low contention (surely there're plenty of CPU cycles to be optimized out in next steps). I model my data after graph model, so most data, even most indices are localized to nodes and edges, those can be manipulated without conflict, that's why I assumed I have a low contention use case since the very beginning - until I found there are still (though minor) needs for global indices to guarantee global uniqueness, I feel faithful with stm-containers/ttrie to implement a more scalable global index data structure, thanks for hinting me. So an evident solution comes into my mind now, is to run the server with a pool of tx processing threads, matching number of CPU cores, client RPC requests then get queued to be executed in some thread from the pool. But I'm really fond of the mechanism of M:N scheduler which solves massive/dynamic concurrency so elegantly. I had some good result with Go in this regard, and see GHC at par in doing this, I don't want to give up this enjoyable machinery. But looked at the stm implementation in GHC, it seems written TVars are exclusively locked during commit of a tx, I suspect this is the culprit when there're large M lightweight threads scheduled upon a small N hardware capabilities, that is when a lightweight thread yield control during an stm transaction commit, the TVars it locked will stay so until it's scheduled again (and again) till it can finish the commit. This way, descheduled threads could hold live threads from progressing. I haven't gone into more details there, but wonder if there can be some improvement for GHC RTS to keep an stm committing thread from descheduled, but seemingly that may impose more starvation potential; or stm can be improved to have its TVar locks preemptable when the owner trec/thread is in descheduled state? Neither should be easy but I'd really love massive lightweight threads doing STM practically well. Best regards, Compl On 2020/7/24 上午12:57, Christopher Allen wrote:

...

It seems like you know how to run practical tests for tuning thread count and contention for throughput. Part of the reason you haven't gotten a super clear answer is "it depends." You give up fairness when you use STM instead of MVars or equivalent structures. That means a long running transaction might get stampeded by many small ones invalidating it over and over. The long-running transaction might never clear if the small transactions keep moving the cheese. I mention this because transaction runtime and size and count all affect throughput and latency. What might be ideal for one pattern of work might not be ideal for another. Optimizing for overall throughput might make the contention and fairness problems worse too. I've done practical tests to optimize this in the past, both for STM in Haskell and for RDBMS workloads.

The next step is sometimes figuring out whether you really need a data structure within a single STM container or if perhaps you can break up your STM container boundaries into zones or regions that roughly map onto update boundaries. That should make the transactions churn less. On the outside chance you do need to touch more than one container in a transaction, well, they compose.

e.g. https://hackage.haskell.org/package/stm-containers https://hackage.haskell.org/package/ttrie

It also sounds a bit like your question bumps into Amdahl's Law a bit.

All else fails, stop using STM and find something more tuned to your problem space.

Hope this helps, Chris Allen

On Thu, Jul 23, 2020 at 9:53 AM YueCompl via Haskell-Cafe mailto:haskell-cafe@haskell.org> wrote:

Hello Cafe,

I'm working on an in-memory database, in Client/Server mode I just let each connected client submit remote procedure call running in its dedicated lightweight thread, modifying TVars in RAM per its business needs, then in case many clients connected concurrently and trying to insert new data, if they are triggering global index (some TVar) update, the throughput would drop drastically. I reduced the shared state to a simple int counter by TVar, got same symptom. While the parallelism feels okay when there's no hot TVar conflicting, or M is not much greater than N.

As an empirical test workload, I have a `+RTS -N10` server process, it handles 10 concurrent clients okay, got ~5x of single thread throughput; but in handling 20 concurrent clients, each of the 10 CPUs can only be driven to ~10% utilization, the throughput seems even worse than single thread. More clients can even drive it thrashing without much progressing.

I can understand that pure STM doesn't scale well after reading [1], and I see it suggested [7] attractive and planned future work toward that direction.

But I can't find certain libraries or frameworks addressing large M over small N scenarios, [1] experimented with designated N parallelism, and [7] is rather theoretical to my empirical needs.

Can you direct me to some available library implementing the methodology proposed in [7] or other ways tackling this problem?

I think the most difficult one is that a transaction should commit with global indices (with possibly unique constraints) atomically updated, and rollback with any violation of constraints, i.e. transactions have to cover global states like indices. Other problems seem more trivial than this.

Specifically, [7] states:

> It must be emphasized that all of the mechanisms we deploy originate, in one form or another, in the database literature from the 70s and 80s. Our contribution is to adapt these techniques to software transactional memory, providing more effective solutions to important STM problems than prior proposals.

I wonder any STM based library has simplified those techniques to be composed right away? I don't really want to implement those mechanisms by myself, rebuilding many wheels from scratch.

Best regards, Compl

[1] Comparing the performance of concurrent linked-list implementations in Haskell https://simonmar.github.io/bib/papers/concurrent-data.pdf

[7] M. Herlihy and E. Koskinen. Transactional boosting: a methodology for highly-concurrent transactional objects. In Proc. of PPoPP ’08, pages 207–216. ACM Press, 2008. https://www.cs.stevens.edu/~ejk/papers/boosting-ppopp08.pdf

_______________________________________________ Haskell-Cafe mailing list To (un)subscribe, modify options or view archives go to: http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe Only members subscribed via the mailman list are allowed to post.

-- Chris Allen Currently working on http://haskellbook.com

Ryan Yates

2:03 p.m.

Hi Compl, Having a pool of transaction processing threads can be helpful in a certain way. If the body of the transaction takes more time to execute then the Haskell thread is allowed and it yields, the suspended thread won't get in the way of other thread, but when it is rescheduled, will have a low probability of success. Even worse, it will probably not discover that it is doomed to failure until commit time. If transactions are more likely to reach commit without yielding, they are more likely to succeed. If the transactions are not conflicting, it doesn't make much difference other than cache churn. The Haskell capability that is committing a transaction will not yield to another Haskell thread while it is doing the commit. The OS thread may be preempted, but once commit starts the haskell scheduler is not invoked until after locks are released. To get good performance from STM you must pay attention to what TVars are involved in a commit. All STM systems are working under the assumption of low contention, so you want to minimize "false" conflicts (conflicts that are not essential to the computation). Something like `TVar (HashMap k v)` will work pretty well for a low thread count, but every transaction that writes to that structure will be in conflict with every other transaction that accesses it. Pushing the `TVar` into the nodes of the structure reduces the possibilities for conflict, while increasing the amount of bookkeeping STM has to do. I would like to reduce the cost of that bookkeeping using better structures, but we need to do so without harming performance in the low TVar count case. Right now it is optimized for good cache performance with a handful of TVars. There is another way to play with performance by moving work into and out of the transaction body. A transaction body that executes quickly will reach commit faster. But it may be delaying work that moves into another transaction. Forcing values at the right time can make a big difference. Ryan On Fri, Jul 24, 2020 at 2:14 AM Compl Yue via Haskell-Cafe < haskell-cafe@haskell.org> wrote:

...

Thanks Chris, I confess I didn't pay enough attention to STM specialized container libraries by far, I skimmed through the description of stm-containers and ttrie, and feel they would definitely improve my code's performance in case I limit the server's parallelism within hardware capabilities. That may because I'm still prototyping the api and infrastructure for correctness, so even `TVar (HashMap k v)` performs okay for me at the moment, only if at low contention (surely there're plenty of CPU cycles to be optimized out in next steps). I model my data after graph model, so most data, even most indices are localized to nodes and edges, those can be manipulated without conflict, that's why I assumed I have a low contention use case since the very beginning - until I found there are still (though minor) needs for global indices to guarantee global uniqueness, I feel faithful with stm-containers/ttrie to implement a more scalable global index data structure, thanks for hinting me.

So an evident solution comes into my mind now, is to run the server with a pool of tx processing threads, matching number of CPU cores, client RPC requests then get queued to be executed in some thread from the pool. But I'm really fond of the mechanism of M:N scheduler which solves massive/dynamic concurrency so elegantly. I had some good result with Go in this regard, and see GHC at par in doing this, I don't want to give up this enjoyable machinery.

But looked at the stm implementation in GHC, it seems written TVars are exclusively locked during commit of a tx, I suspect this is the culprit when there're large M lightweight threads scheduled upon a small N hardware capabilities, that is when a lightweight thread yield control during an stm transaction commit, the TVars it locked will stay so until it's scheduled again (and again) till it can finish the commit. This way, descheduled threads could hold live threads from progressing. I haven't gone into more details there, but wonder if there can be some improvement for GHC RTS to keep an stm committing thread from descheduled, but seemingly that may impose more starvation potential; or stm can be improved to have its TVar locks preemptable when the owner trec/thread is in descheduled state? Neither should be easy but I'd really love massive lightweight threads doing STM practically well.

Best regards,

Compl

On 2020/7/24 上午12:57, Christopher Allen wrote:

It seems like you know how to run practical tests for tuning thread count and contention for throughput. Part of the reason you haven't gotten a super clear answer is "it depends." You give up fairness when you use STM instead of MVars or equivalent structures. That means a long running transaction might get stampeded by many small ones invalidating it over and over. The long-running transaction might never clear if the small transactions keep moving the cheese. I mention this because transaction runtime and size and count all affect throughput and latency. What might be ideal for one pattern of work might not be ideal for another. Optimizing for overall throughput might make the contention and fairness problems worse too. I've done practical tests to optimize this in the past, both for STM in Haskell and for RDBMS workloads.

The next step is sometimes figuring out whether you really need a data structure within a single STM container or if perhaps you can break up your STM container boundaries into zones or regions that roughly map onto update boundaries. That should make the transactions churn less. On the outside chance you do need to touch more than one container in a transaction, well, they compose.

e.g. https://hackage.haskell.org/package/stm-containers https://hackage.haskell.org/package/ttrie

It also sounds a bit like your question bumps into Amdahl's Law a bit.

All else fails, stop using STM and find something more tuned to your problem space.

Hope this helps, Chris Allen

On Thu, Jul 23, 2020 at 9:53 AM YueCompl via Haskell-Cafe < haskell-cafe@haskell.org> wrote:

...
Hello Cafe,

I'm working on an in-memory database, in Client/Server mode I just let each connected client submit remote procedure call running in its dedicated lightweight thread, modifying TVars in RAM per its business needs, then in case many clients connected concurrently and trying to insert new data, if they are triggering global index (some TVar) update, the throughput would drop drastically. I reduced the shared state to a simple int counter by TVar, got same symptom. While the parallelism feels okay when there's no hot TVar conflicting, or M is not much greater than N.

As an empirical test workload, I have a `+RTS -N10` server process, it handles 10 concurrent clients okay, got ~5x of single thread throughput; but in handling 20 concurrent clients, each of the 10 CPUs can only be driven to ~10% utilization, the throughput seems even worse than single thread. More clients can even drive it thrashing without much progressing.

I can understand that pure STM doesn't scale well after reading [1], and I see it suggested [7] attractive and planned future work toward that direction.

But I can't find certain libraries or frameworks addressing large M over small N scenarios, [1] experimented with designated N parallelism, and [7] is rather theoretical to my empirical needs.

Can you direct me to some available library implementing the methodology proposed in [7] or other ways tackling this problem?

I think the most difficult one is that a transaction should commit with global indices (with possibly unique constraints) atomically updated, and rollback with any violation of constraints, i.e. transactions have to cover global states like indices. Other problems seem more trivial than this.

Specifically, [7] states:

...
It must be emphasized that all of the mechanisms we deploy originate, in one form or another, in the database literature from the 70s and 80s. Our contribution is to adapt these techniques to software transactional memory, providing more effective solutions to important STM problems than prior proposals.

I wonder any STM based library has simplified those techniques to be composed right away? I don't really want to implement those mechanisms by myself, rebuilding many wheels from scratch.

Best regards, Compl

[1] Comparing the performance of concurrent linked-list implementations in Haskell https://simonmar.github.io/bib/papers/concurrent-data.pdf

[7] M. Herlihy and E. Koskinen. Transactional boosting: a methodology for highly-concurrent transactional objects. In Proc. of PPoPP ’08, pages 207–216. ACM Press, 2008. https://www.cs.stevens.edu/~ejk/papers/boosting-ppopp08.pdf

_______________________________________________ Haskell-Cafe mailing list To (un)subscribe, modify options or view archives go to: http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe Only members subscribed via the mailman list are allowed to post.

-- Chris Allen Currently working on http://haskellbook.com

_______________________________________________ Haskell-Cafe mailing list To (un)subscribe, modify options or view archives go to: http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe Only members subscribed via the mailman list are allowed to post.

Compl Yue

3:22 p.m.

...

The Haskell capability that is committing a transaction will not yield to another Haskell thread while it is doing the commit. The OS

Thanks very much for the insightful information Ryan! I'm glad my suspect was wrong about the Haskell scheduler: thread may be preempted, but once commit starts the haskell scheduler is not invoked until after locks are released. So best effort had already been made in GHC and I just need to cooperate better with its design. Then to explain the low CPU utilization (~10%), am I right to understand it as that upon reading a TVar locked by another committing tx, a lightweight thread will put itself into `waiting STM` and descheduled state, so the CPUs can only stay idle as not so many threads are willing to proceed? Anyway, I see light with better data structures to improve my situation, let me try them and report back. Actually I later changed `TVar (HaskMap k v)` to be `TVar (HashMap k Int)` where the `Int` being array index into `TVar (Vector (TVar (Maybe v)))`, in pursuing insertion order preservation semantic of dict entries (like that in Python 3.7+), then it's very hopeful to incorporate stm-containers' Map or ttrie to approach free of contention. Thanks with regards, Compl On 2020/7/24 下午10:03, Ryan Yates wrote:

...

Hi Compl,

Having a pool of transaction processing threads can be helpful in a certain way. If the body of the transaction takes more time to execute then the Haskell thread is allowed and it yields, the suspended thread won't get in the way of other thread, but when it is rescheduled, will have a low probability of success. Even worse, it will probably not discover that it is doomed to failure until commit time. If transactions are more likely to reach commit without yielding, they are more likely to succeed. If the transactions are not conflicting, it doesn't make much difference other than cache churn.

The Haskell capability that is committing a transaction will not yield to another Haskell thread while it is doing the commit. The OS thread may be preempted, but once commit starts the haskell scheduler is not invoked until after locks are released.

To get good performance from STM you must pay attention to what TVars are involved in a commit. All STM systems are working under the assumption of low contention, so you want to minimize "false" conflicts (conflicts that are not essential to the computation). Something like `TVar (HashMap k v)` will work pretty well for a low thread count, but every transaction that writes to that structure will be in conflict with every other transaction that accesses it. Pushing the `TVar` into the nodes of the structure reduces the possibilities for conflict, while increasing the amount of bookkeeping STM has to do. I would like to reduce the cost of that bookkeeping using better structures, but we need to do so without harming performance in the low TVar count case. Right now it is optimized for good cache performance with a handful of TVars.

There is another way to play with performance by moving work into and out of the transaction body. A transaction body that executes quickly will reach commit faster. But it may be delaying work that moves into another transaction. Forcing values at the right time can make a big difference.

Ryan

On Fri, Jul 24, 2020 at 2:14 AM Compl Yue via Haskell-Cafe mailto:haskell-cafe@haskell.org> wrote:

Thanks Chris, I confess I didn't pay enough attention to STM specialized container libraries by far, I skimmed through the description of stm-containers and ttrie, and feel they would definitely improve my code's performance in case I limit the server's parallelism within hardware capabilities. That may because I'm still prototyping the api and infrastructure for correctness, so even `TVar (HashMap k v)` performs okay for me at the moment, only if at low contention (surely there're plenty of CPU cycles to be optimized out in next steps). I model my data after graph model, so most data, even most indices are localized to nodes and edges, those can be manipulated without conflict, that's why I assumed I have a low contention use case since the very beginning - until I found there are still (though minor) needs for global indices to guarantee global uniqueness, I feel faithful with stm-containers/ttrie to implement a more scalable global index data structure, thanks for hinting me.

So an evident solution comes into my mind now, is to run the server with a pool of tx processing threads, matching number of CPU cores, client RPC requests then get queued to be executed in some thread from the pool. But I'm really fond of the mechanism of M:N scheduler which solves massive/dynamic concurrency so elegantly. I had some good result with Go in this regard, and see GHC at par in doing this, I don't want to give up this enjoyable machinery.

But looked at the stm implementation in GHC, it seems written TVars are exclusively locked during commit of a tx, I suspect this is the culprit when there're large M lightweight threads scheduled upon a small N hardware capabilities, that is when a lightweight thread yield control during an stm transaction commit, the TVars it locked will stay so until it's scheduled again (and again) till it can finish the commit. This way, descheduled threads could hold live threads from progressing. I haven't gone into more details there, but wonder if there can be some improvement for GHC RTS to keep an stm committing thread from descheduled, but seemingly that may impose more starvation potential; or stm can be improved to have its TVar locks preemptable when the owner trec/thread is in descheduled state? Neither should be easy but I'd really love massive lightweight threads doing STM practically well.

Best regards,

Compl

On 2020/7/24 上午12:57, Christopher Allen wrote:

...
It seems like you know how to run practical tests for tuning thread count and contention for throughput. Part of the reason you haven't gotten a super clear answer is "it depends." You give up fairness when you use STM instead of MVars or equivalent structures. That means a long running transaction might get stampeded by many small ones invalidating it over and over. The long-running transaction might never clear if the small transactions keep moving the cheese. I mention this because transaction runtime and size and count all affect throughput and latency. What might be ideal for one pattern of work might not be ideal for another. Optimizing for overall throughput might make the contention and fairness problems worse too. I've done practical tests to optimize this in the past, both for STM in Haskell and for RDBMS workloads.

The next step is sometimes figuring out whether you really need a data structure within a single STM container or if perhaps you can break up your STM container boundaries into zones or regions that roughly map onto update boundaries. That should make the transactions churn less. On the outside chance you do need to touch more than one container in a transaction, well, they compose.

e.g. https://hackage.haskell.org/package/stm-containers https://hackage.haskell.org/package/ttrie

It also sounds a bit like your question bumps into Amdahl's Law a bit.

All else fails, stop using STM and find something more tuned to your problem space.

Hope this helps, Chris Allen

On Thu, Jul 23, 2020 at 9:53 AM YueCompl via Haskell-Cafe mailto:haskell-cafe@haskell.org> wrote:

Hello Cafe,

I'm working on an in-memory database, in Client/Server mode I just let each connected client submit remote procedure call running in its dedicated lightweight thread, modifying TVars in RAM per its business needs, then in case many clients connected concurrently and trying to insert new data, if they are triggering global index (some TVar) update, the throughput would drop drastically. I reduced the shared state to a simple int counter by TVar, got same symptom. While the parallelism feels okay when there's no hot TVar conflicting, or M is not much greater than N.

As an empirical test workload, I have a `+RTS -N10` server process, it handles 10 concurrent clients okay, got ~5x of single thread throughput; but in handling 20 concurrent clients, each of the 10 CPUs can only be driven to ~10% utilization, the throughput seems even worse than single thread. More clients can even drive it thrashing without much progressing.

I can understand that pure STM doesn't scale well after reading [1], and I see it suggested [7] attractive and planned future work toward that direction.

But I can't find certain libraries or frameworks addressing large M over small N scenarios, [1] experimented with designated N parallelism, and [7] is rather theoretical to my empirical needs.

Can you direct me to some available library implementing the methodology proposed in [7] or other ways tackling this problem?

I think the most difficult one is that a transaction should commit with global indices (with possibly unique constraints) atomically updated, and rollback with any violation of constraints, i.e. transactions have to cover global states like indices. Other problems seem more trivial than this.

Specifically, [7] states:

> It must be emphasized that all of the mechanisms we deploy originate, in one form or another, in the database literature from the 70s and 80s. Our contribution is to adapt these techniques to software transactional memory, providing more effective solutions to important STM problems than prior proposals.

I wonder any STM based library has simplified those techniques to be composed right away? I don't really want to implement those mechanisms by myself, rebuilding many wheels from scratch.

Best regards, Compl

[1] Comparing the performance of concurrent linked-list implementations in Haskell https://simonmar.github.io/bib/papers/concurrent-data.pdf

[7] M. Herlihy and E. Koskinen. Transactional boosting: a methodology for highly-concurrent transactional objects. In Proc. of PPoPP ’08, pages 207–216. ACM Press, 2008. https://www.cs.stevens.edu/~ejk/papers/boosting-ppopp08.pdf

_______________________________________________ Haskell-Cafe mailing list To (un)subscribe, modify options or view archives go to: http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe Only members subscribed via the mailman list are allowed to post.

-- Chris Allen Currently working on http://haskellbook.com

_______________________________________________ Haskell-Cafe mailing list To (un)subscribe, modify options or view archives go to: http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe Only members subscribed via the mailman list are allowed to post.

Ryan Yates

3:46 p.m.

...

Then to explain the low CPU utilization (~10%), am I right to understand it as that upon reading a TVar locked by another committing tx, a lightweight thread will put itself into `waiting STM` and descheduled state, so the CPUs can only stay idle as not so many threads are willing to proceed?

Since the commit happens in finite steps, the expectation is that the lock will be released very soon. Given this when the body of a transaction executes `readTVar` it spins (active CPU!) until the `TVar` is observed unlocked. If a lock is observed while commiting, it immediately starts the transaction again from the beginning. To get the behavior of suspending a transaction you have to successfully commit a transaction that executed `retry`. Then the transaction is put on the wakeup lists of its read set and subsequent commits will wake it up if its write set overlaps. I don't think any of these things would explain low CPU utilization. You could try running with `perf` and see if lots of time is spent in some recognizable part of the RTS. Ryan On Fri, Jul 24, 2020 at 11:22 AM Compl Yue wrote:

...

Thanks very much for the insightful information Ryan! I'm glad my suspect was wrong about the Haskell scheduler:

...
The Haskell capability that is committing a transaction will not yield to another Haskell thread while it is doing the commit. The OS thread may be preempted, but once commit starts the haskell scheduler is not invoked until after locks are released. So best effort had already been made in GHC and I just need to cooperate better with its design. Then to explain the low CPU utilization (~10%), am I right to understand it as that upon reading a TVar locked by another committing tx, a lightweight thread will put itself into `waiting STM` and descheduled state, so the CPUs can only stay idle as not so many threads are willing to proceed?

Anyway, I see light with better data structures to improve my situation, let me try them and report back. Actually I later changed `TVar (HaskMap k v)` to be `TVar (HashMap k Int)` where the `Int` being array index into `TVar (Vector (TVar (Maybe v)))`, in pursuing insertion order preservation semantic of dict entries (like that in Python 3.7+), then it's very hopeful to incorporate stm-containers' Map or ttrie to approach free of contention.

Thanks with regards,

Compl

On 2020/7/24 下午10:03, Ryan Yates wrote:

Hi Compl,

Having a pool of transaction processing threads can be helpful in a certain way. If the body of the transaction takes more time to execute then the Haskell thread is allowed and it yields, the suspended thread won't get in the way of other thread, but when it is rescheduled, will have a low probability of success. Even worse, it will probably not discover that it is doomed to failure until commit time. If transactions are more likely to reach commit without yielding, they are more likely to succeed. If the transactions are not conflicting, it doesn't make much difference other than cache churn.

The Haskell capability that is committing a transaction will not yield to another Haskell thread while it is doing the commit. The OS thread may be preempted, but once commit starts the haskell scheduler is not invoked until after locks are released.

To get good performance from STM you must pay attention to what TVars are involved in a commit. All STM systems are working under the assumption of low contention, so you want to minimize "false" conflicts (conflicts that are not essential to the computation). Something like `TVar (HashMap k v)` will work pretty well for a low thread count, but every transaction that writes to that structure will be in conflict with every other transaction that accesses it. Pushing the `TVar` into the nodes of the structure reduces the possibilities for conflict, while increasing the amount of bookkeeping STM has to do. I would like to reduce the cost of that bookkeeping using better structures, but we need to do so without harming performance in the low TVar count case. Right now it is optimized for good cache performance with a handful of TVars.

There is another way to play with performance by moving work into and out of the transaction body. A transaction body that executes quickly will reach commit faster. But it may be delaying work that moves into another transaction. Forcing values at the right time can make a big difference.

Ryan

On Fri, Jul 24, 2020 at 2:14 AM Compl Yue via Haskell-Cafe < haskell-cafe@haskell.org> wrote:

...
Thanks Chris, I confess I didn't pay enough attention to STM specialized container libraries by far, I skimmed through the description of stm-containers and ttrie, and feel they would definitely improve my code's performance in case I limit the server's parallelism within hardware capabilities. That may because I'm still prototyping the api and infrastructure for correctness, so even `TVar (HashMap k v)` performs okay for me at the moment, only if at low contention (surely there're plenty of CPU cycles to be optimized out in next steps). I model my data after graph model, so most data, even most indices are localized to nodes and edges, those can be manipulated without conflict, that's why I assumed I have a low contention use case since the very beginning - until I found there are still (though minor) needs for global indices to guarantee global uniqueness, I feel faithful with stm-containers/ttrie to implement a more scalable global index data structure, thanks for hinting me.

So an evident solution comes into my mind now, is to run the server with a pool of tx processing threads, matching number of CPU cores, client RPC requests then get queued to be executed in some thread from the pool. But I'm really fond of the mechanism of M:N scheduler which solves massive/dynamic concurrency so elegantly. I had some good result with Go in this regard, and see GHC at par in doing this, I don't want to give up this enjoyable machinery.

But looked at the stm implementation in GHC, it seems written TVars are exclusively locked during commit of a tx, I suspect this is the culprit when there're large M lightweight threads scheduled upon a small N hardware capabilities, that is when a lightweight thread yield control during an stm transaction commit, the TVars it locked will stay so until it's scheduled again (and again) till it can finish the commit. This way, descheduled threads could hold live threads from progressing. I haven't gone into more details there, but wonder if there can be some improvement for GHC RTS to keep an stm committing thread from descheduled, but seemingly that may impose more starvation potential; or stm can be improved to have its TVar locks preemptable when the owner trec/thread is in descheduled state? Neither should be easy but I'd really love massive lightweight threads doing STM practically well.

Best regards,

Compl

On 2020/7/24 上午12:57, Christopher Allen wrote:

It seems like you know how to run practical tests for tuning thread count and contention for throughput. Part of the reason you haven't gotten a super clear answer is "it depends." You give up fairness when you use STM instead of MVars or equivalent structures. That means a long running transaction might get stampeded by many small ones invalidating it over and over. The long-running transaction might never clear if the small transactions keep moving the cheese. I mention this because transaction runtime and size and count all affect throughput and latency. What might be ideal for one pattern of work might not be ideal for another. Optimizing for overall throughput might make the contention and fairness problems worse too. I've done practical tests to optimize this in the past, both for STM in Haskell and for RDBMS workloads.

The next step is sometimes figuring out whether you really need a data structure within a single STM container or if perhaps you can break up your STM container boundaries into zones or regions that roughly map onto update boundaries. That should make the transactions churn less. On the outside chance you do need to touch more than one container in a transaction, well, they compose.

e.g. https://hackage.haskell.org/package/stm-containers https://hackage.haskell.org/package/ttrie

It also sounds a bit like your question bumps into Amdahl's Law a bit.

All else fails, stop using STM and find something more tuned to your problem space.

Hope this helps, Chris Allen

On Thu, Jul 23, 2020 at 9:53 AM YueCompl via Haskell-Cafe < haskell-cafe@haskell.org> wrote:

...
Hello Cafe,

I'm working on an in-memory database, in Client/Server mode I just let each connected client submit remote procedure call running in its dedicated lightweight thread, modifying TVars in RAM per its business needs, then in case many clients connected concurrently and trying to insert new data, if they are triggering global index (some TVar) update, the throughput would drop drastically. I reduced the shared state to a simple int counter by TVar, got same symptom. While the parallelism feels okay when there's no hot TVar conflicting, or M is not much greater than N.

As an empirical test workload, I have a `+RTS -N10` server process, it handles 10 concurrent clients okay, got ~5x of single thread throughput; but in handling 20 concurrent clients, each of the 10 CPUs can only be driven to ~10% utilization, the throughput seems even worse than single thread. More clients can even drive it thrashing without much progressing.

I can understand that pure STM doesn't scale well after reading [1], and I see it suggested [7] attractive and planned future work toward that direction.

But I can't find certain libraries or frameworks addressing large M over small N scenarios, [1] experimented with designated N parallelism, and [7] is rather theoretical to my empirical needs.

Can you direct me to some available library implementing the methodology proposed in [7] or other ways tackling this problem?

I think the most difficult one is that a transaction should commit with global indices (with possibly unique constraints) atomically updated, and rollback with any violation of constraints, i.e. transactions have to cover global states like indices. Other problems seem more trivial than this.

Specifically, [7] states:

...
It must be emphasized that all of the mechanisms we deploy originate, in one form or another, in the database literature from the 70s and 80s. Our contribution is to adapt these techniques to software transactional memory, providing more effective solutions to important STM problems than prior proposals.

I wonder any STM based library has simplified those techniques to be composed right away? I don't really want to implement those mechanisms by myself, rebuilding many wheels from scratch.

Best regards, Compl

[1] Comparing the performance of concurrent linked-list implementations in Haskell https://simonmar.github.io/bib/papers/concurrent-data.pdf

[7] M. Herlihy and E. Koskinen. Transactional boosting: a methodology for highly-concurrent transactional objects. In Proc. of PPoPP ’08, pages 207–216. ACM Press, 2008. https://www.cs.stevens.edu/~ejk/papers/boosting-ppopp08.pdf

_______________________________________________ Haskell-Cafe mailing list To (un)subscribe, modify options or view archives go to: http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe Only members subscribed via the mailman list are allowed to post.

-- Chris Allen Currently working on http://haskellbook.com

_______________________________________________ Haskell-Cafe mailing list To (un)subscribe, modify options or view archives go to: http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe Only members subscribed via the mailman list are allowed to post.

Compl Yue

4:35 p.m.

I'm not familiar with profiling GHC yet, may need more time to get myself proficient with it. And a bit more details of my test workload for diagnostic: the db clients are Python processes from a cluster of worker nodes, consulting the db server to register some path for data files, under a data dir within a shared filesystem, then mmap those data files and fill in actual array data. So the db server don't have much computation to perform, but puts the data file path into a global index, which at the same validates its uniqueness. As there are many client processes trying to insert one meta data record concurrently, with my naive implementation, the global index's TVar will almost always in locked state by one client after another, from a queue never fall empty. So if `readTVar` should spinning waiting, I doubt the threads should actually make high CPU utilization, because at any instant of time, all threads except the committing one will be doing that one thing. And I have something in my code to track STM retry like this: ``` -- blocking wait not expected, track stm retries explicitly trackSTM:: Int-> IO(Either() a) trackSTM !rtc = do when -- todo increase the threshold of reporting? (rtc > 0) $ do -- trace out the retries so the end users can be aware of them tid <- myThreadId trace ( "🔙\n" <> show callCtx <> "🌀 " <> show tid <> " stm retry #" <> show rtc ) $ return () atomically ((Just <$> stmJob) `orElse` return Nothing) >>= \case Nothing -> -- stm failed, do a tracked retry trackSTM (rtc + 1) Just ... -> ... ``` No such trace msg fires during my test, neither in single thread run, nor in runs with pressure. I'm sure this tracing mechanism works, as I can see such traces fire, in case e.g. posting a TMVar to a TQueue for some other thread to fill it, then read the result out, if these 2 ops are composed into a single tx, then of course it's infinite retry loop, and a sequence of such msgs are logged with ever increasing rtc #. So I believe no retry has ever been triggered. What can going on there? On 2020/7/24 下午11:46, Ryan Yates wrote:

...

...
Then to explain the low CPU utilization (~10%), am I right to understand it as that upon reading a TVar locked by another committing tx, a lightweight thread will put itself into `waiting STM` and descheduled state, so the CPUs can only stay idle as not so many threads are willing to proceed?

Since the commit happens in finite steps, the expectation is that the lock will be released very soon. Given this when the body of a transaction executes `readTVar` it spins (active CPU!) until the `TVar` is observed unlocked. If a lock is observed while commiting, it immediately starts the transaction again from the beginning. To get the behavior of suspending a transaction you have to successfully commit a transaction that executed `retry`. Then the transaction is put on the wakeup lists of its read set and subsequent commits will wake it up if its write set overlaps.

I don't think any of these things would explain low CPU utilization. You could try running with `perf` and see if lots of time is spent in some recognizable part of the RTS.

Ryan

On Fri, Jul 24, 2020 at 11:22 AM Compl Yue mailto:compl.yue@icloud.com> wrote:

Thanks very much for the insightful information Ryan! I'm glad my suspect was wrong about the Haskell scheduler:

> The Haskell capability that is committing a transaction will not yield to another Haskell thread while it is doing the commit. The OS thread may be preempted, but once commit starts the haskell scheduler is not invoked until after locks are released.

So best effort had already been made in GHC and I just need to cooperate better with its design. Then to explain the low CPU utilization (~10%), am I right to understand it as that upon reading a TVar locked by another committing tx, a lightweight thread will put itself into `waiting STM` and descheduled state, so the CPUs can only stay idle as not so many threads are willing to proceed?

Anyway, I see light with better data structures to improve my situation, let me try them and report back. Actually I later changed `TVar (HaskMap k v)` to be `TVar (HashMap k Int)` where the `Int` being array index into `TVar (Vector (TVar (Maybe v)))`, in pursuing insertion order preservation semantic of dict entries (like that in Python 3.7+), then it's very hopeful to incorporate stm-containers' Map or ttrie to approach free of contention.

Thanks with regards,

Compl

On 2020/7/24 下午10:03, Ryan Yates wrote:

...
Hi Compl,

Having a pool of transaction processing threads can be helpful in a certain way. If the body of the transaction takes more time to execute then the Haskell thread is allowed and it yields, the suspended thread won't get in the way of other thread, but when it is rescheduled, will have a low probability of success. Even worse, it will probably not discover that it is doomed to failure until commit time. If transactions are more likely to reach commit without yielding, they are more likely to succeed. If the transactions are not conflicting, it doesn't make much difference other than cache churn.

The Haskell capability that is committing a transaction will not yield to another Haskell thread while it is doing the commit. The OS thread may be preempted, but once commit starts the haskell scheduler is not invoked until after locks are released.

To get good performance from STM you must pay attention to what TVars are involved in a commit. All STM systems are working under the assumption of low contention, so you want to minimize "false" conflicts (conflicts that are not essential to the computation). Something like `TVar (HashMap k v)` will work pretty well for a low thread count, but every transaction that writes to that structure will be in conflict with every other transaction that accesses it. Pushing the `TVar` into the nodes of the structure reduces the possibilities for conflict, while increasing the amount of bookkeeping STM has to do. I would like to reduce the cost of that bookkeeping using better structures, but we need to do so without harming performance in the low TVar count case. Right now it is optimized for good cache performance with a handful of TVars.

There is another way to play with performance by moving work into and out of the transaction body. A transaction body that executes quickly will reach commit faster. But it may be delaying work that moves into another transaction. Forcing values at the right time can make a big difference.

Ryan

On Fri, Jul 24, 2020 at 2:14 AM Compl Yue via Haskell-Cafe mailto:haskell-cafe@haskell.org> wrote:

Thanks Chris, I confess I didn't pay enough attention to STM specialized container libraries by far, I skimmed through the description of stm-containers and ttrie, and feel they would definitely improve my code's performance in case I limit the server's parallelism within hardware capabilities. That may because I'm still prototyping the api and infrastructure for correctness, so even `TVar (HashMap k v)` performs okay for me at the moment, only if at low contention (surely there're plenty of CPU cycles to be optimized out in next steps). I model my data after graph model, so most data, even most indices are localized to nodes and edges, those can be manipulated without conflict, that's why I assumed I have a low contention use case since the very beginning - until I found there are still (though minor) needs for global indices to guarantee global uniqueness, I feel faithful with stm-containers/ttrie to implement a more scalable global index data structure, thanks for hinting me.

So an evident solution comes into my mind now, is to run the server with a pool of tx processing threads, matching number of CPU cores, client RPC requests then get queued to be executed in some thread from the pool. But I'm really fond of the mechanism of M:N scheduler which solves massive/dynamic concurrency so elegantly. I had some good result with Go in this regard, and see GHC at par in doing this, I don't want to give up this enjoyable machinery.

But looked at the stm implementation in GHC, it seems written TVars are exclusively locked during commit of a tx, I suspect this is the culprit when there're large M lightweight threads scheduled upon a small N hardware capabilities, that is when a lightweight thread yield control during an stm transaction commit, the TVars it locked will stay so until it's scheduled again (and again) till it can finish the commit. This way, descheduled threads could hold live threads from progressing. I haven't gone into more details there, but wonder if there can be some improvement for GHC RTS to keep an stm committing thread from descheduled, but seemingly that may impose more starvation potential; or stm can be improved to have its TVar locks preemptable when the owner trec/thread is in descheduled state? Neither should be easy but I'd really love massive lightweight threads doing STM practically well.

Best regards,

Compl

On 2020/7/24 上午12:57, Christopher Allen wrote:

...
It seems like you know how to run practical tests for tuning thread count and contention for throughput. Part of the reason you haven't gotten a super clear answer is "it depends." You give up fairness when you use STM instead of MVars or equivalent structures. That means a long running transaction might get stampeded by many small ones invalidating it over and over. The long-running transaction might never clear if the small transactions keep moving the cheese. I mention this because transaction runtime and size and count all affect throughput and latency. What might be ideal for one pattern of work might not be ideal for another. Optimizing for overall throughput might make the contention and fairness problems worse too. I've done practical tests to optimize this in the past, both for STM in Haskell and for RDBMS workloads.

The next step is sometimes figuring out whether you really need a data structure within a single STM container or if perhaps you can break up your STM container boundaries into zones or regions that roughly map onto update boundaries. That should make the transactions churn less. On the outside chance you do need to touch more than one container in a transaction, well, they compose.

e.g. https://hackage.haskell.org/package/stm-containers https://hackage.haskell.org/package/ttrie

It also sounds a bit like your question bumps into Amdahl's Law a bit.

All else fails, stop using STM and find something more tuned to your problem space.

Hope this helps, Chris Allen

On Thu, Jul 23, 2020 at 9:53 AM YueCompl via Haskell-Cafe mailto:haskell-cafe@haskell.org> wrote:

Hello Cafe,

I'm working on an in-memory database, in Client/Server mode I just let each connected client submit remote procedure call running in its dedicated lightweight thread, modifying TVars in RAM per its business needs, then in case many clients connected concurrently and trying to insert new data, if they are triggering global index (some TVar) update, the throughput would drop drastically. I reduced the shared state to a simple int counter by TVar, got same symptom. While the parallelism feels okay when there's no hot TVar conflicting, or M is not much greater than N.

As an empirical test workload, I have a `+RTS -N10` server process, it handles 10 concurrent clients okay, got ~5x of single thread throughput; but in handling 20 concurrent clients, each of the 10 CPUs can only be driven to ~10% utilization, the throughput seems even worse than single thread. More clients can even drive it thrashing without much progressing.

I can understand that pure STM doesn't scale well after reading [1], and I see it suggested [7] attractive and planned future work toward that direction.

But I can't find certain libraries or frameworks addressing large M over small N scenarios, [1] experimented with designated N parallelism, and [7] is rather theoretical to my empirical needs.

Can you direct me to some available library implementing the methodology proposed in [7] or other ways tackling this problem?

I think the most difficult one is that a transaction should commit with global indices (with possibly unique constraints) atomically updated, and rollback with any violation of constraints, i.e. transactions have to cover global states like indices. Other problems seem more trivial than this.

Specifically, [7] states:

> It must be emphasized that all of the mechanisms we deploy originate, in one form or another, in the database literature from the 70s and 80s. Our contribution is to adapt these techniques to software transactional memory, providing more effective solutions to important STM problems than prior proposals.

I wonder any STM based library has simplified those techniques to be composed right away? I don't really want to implement those mechanisms by myself, rebuilding many wheels from scratch.

Best regards, Compl

[1] Comparing the performance of concurrent linked-list implementations in Haskell https://simonmar.github.io/bib/papers/concurrent-data.pdf

[7] M. Herlihy and E. Koskinen. Transactional boosting: a methodology for highly-concurrent transactional objects. In Proc. of PPoPP ’08, pages 207–216. ACM Press, 2008. https://www.cs.stevens.edu/~ejk/papers/boosting-ppopp08.pdf

_______________________________________________ Haskell-Cafe mailing list To (un)subscribe, modify options or view archives go to: http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe Only members subscribed via the mailman list are allowed to post.

-- Chris Allen Currently working on http://haskellbook.com

_______________________________________________ Haskell-Cafe mailing list To (un)subscribe, modify options or view archives go to: http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe Only members subscribed via the mailman list are allowed to post.

Ryan Yates

6:02 p.m.

To be clear, I was trying to refer to Linux `perf` [^1]. Sampling based profiling can do a good job with concurrent and parallel programs where other methods are problematic. For instance, changing the size of heap objects can drastically change cache performance and completely different behavior can show up. [^1]: https://en.wikipedia.org/wiki/Perf_(Linux) The spinning in `readTVar` should always be very short and it typically shows up as intensive CPU use, though it may not be high energy use with `pause` in the loop on x86 (looks like we don't have it [^2], I thought we did, but maybe that was only in some of my code... ) [^2]: https://github.com/ghc/ghc/blob/master/rts/STM.c#L1275 All that to say, I doubt that you are spending much time spinning (but it would certainly be interesting to know if you are! You would see `perf` attribute a large amount of time to `read_current_value`). The amount of code to execute for commit (the time when locks are held) is always much shorter than it takes to execute the transaction body. As you add more conflicting threads this gets worse of course as commits sequence. The code you have will count commits of executions of `retry`. Note that `retry` is a user level idea, that is, you are counting user level *explicit* retries. This is different from a transaction failing to commit and starting again. These are invisible to the user. Also using your trace will convert `retry` from the efficient wake on write implementation, to an active retry that will always attempt again. We don't have cheap logging of transaction aborts in GHC, but I have built such logging in my work. You can observe these aborts with a debugger by looking for execution of this line: https://github.com/ghc/ghc/blob/master/rts/STM.c#L1123 Ryan On Fri, Jul 24, 2020 at 12:35 PM Compl Yue wrote:

...

I'm not familiar with profiling GHC yet, may need more time to get myself proficient with it.

And a bit more details of my test workload for diagnostic: the db clients are Python processes from a cluster of worker nodes, consulting the db server to register some path for data files, under a data dir within a shared filesystem, then mmap those data files and fill in actual array data. So the db server don't have much computation to perform, but puts the data file path into a global index, which at the same validates its uniqueness. As there are many client processes trying to insert one meta data record concurrently, with my naive implementation, the global index's TVar will almost always in locked state by one client after another, from a queue never fall empty.

So if `readTVar` should spinning waiting, I doubt the threads should actually make high CPU utilization, because at any instant of time, all threads except the committing one will be doing that one thing.

And I have something in my code to track STM retry like this:

``` -- blocking wait not expected, track stm retries explicitly trackSTM :: Int -> IO (Either () a) trackSTM !rtc = do when -- todo increase the threshold of reporting? (rtc > 0) $ do -- trace out the retries so the end users can be aware of them tid <- myThreadId trace ( "🔙\n" <> show callCtx <> "🌀 " <> show tid <> " stm retry #" <> show rtc ) $ return () atomically ((Just <$> stmJob) `orElse` return Nothing) >>= \case Nothing -> -- stm failed, do a tracked retry trackSTM (rtc + 1) Just ... -> ...

```

No such trace msg fires during my test, neither in single thread run, nor in runs with pressure. I'm sure this tracing mechanism works, as I can see such traces fire, in case e.g. posting a TMVar to a TQueue for some other thread to fill it, then read the result out, if these 2 ops are composed into a single tx, then of course it's infinite retry loop, and a sequence of such msgs are logged with ever increasing rtc #.

So I believe no retry has ever been triggered.

What can going on there?

On 2020/7/24 下午11:46, Ryan Yates wrote:

...
Then to explain the low CPU utilization (~10%), am I right to understand it as that upon reading a TVar locked by another committing tx, a lightweight thread will put itself into `waiting STM` and descheduled state, so the CPUs can only stay idle as not so many threads are willing to proceed?

Since the commit happens in finite steps, the expectation is that the lock will be released very soon. Given this when the body of a transaction executes `readTVar` it spins (active CPU!) until the `TVar` is observed unlocked. If a lock is observed while commiting, it immediately starts the transaction again from the beginning. To get the behavior of suspending a transaction you have to successfully commit a transaction that executed `retry`. Then the transaction is put on the wakeup lists of its read set and subsequent commits will wake it up if its write set overlaps.

I don't think any of these things would explain low CPU utilization. You could try running with `perf` and see if lots of time is spent in some recognizable part of the RTS.

Ryan

On Fri, Jul 24, 2020 at 11:22 AM Compl Yue wrote:

...
Thanks very much for the insightful information Ryan! I'm glad my suspect was wrong about the Haskell scheduler:

...
The Haskell capability that is committing a transaction will not yield to another Haskell thread while it is doing the commit. The OS thread may be preempted, but once commit starts the haskell scheduler is not invoked until after locks are released. So best effort had already been made in GHC and I just need to cooperate better with its design. Then to explain the low CPU utilization (~10%), am I right to understand it as that upon reading a TVar locked by another committing tx, a lightweight thread will put itself into `waiting STM` and descheduled state, so the CPUs can only stay idle as not so many threads are willing to proceed?

Anyway, I see light with better data structures to improve my situation, let me try them and report back. Actually I later changed `TVar (HaskMap k v)` to be `TVar (HashMap k Int)` where the `Int` being array index into `TVar (Vector (TVar (Maybe v)))`, in pursuing insertion order preservation semantic of dict entries (like that in Python 3.7+), then it's very hopeful to incorporate stm-containers' Map or ttrie to approach free of contention.

Thanks with regards,

Compl

On 2020/7/24 下午10:03, Ryan Yates wrote:

Hi Compl,

Having a pool of transaction processing threads can be helpful in a certain way. If the body of the transaction takes more time to execute then the Haskell thread is allowed and it yields, the suspended thread won't get in the way of other thread, but when it is rescheduled, will have a low probability of success. Even worse, it will probably not discover that it is doomed to failure until commit time. If transactions are more likely to reach commit without yielding, they are more likely to succeed. If the transactions are not conflicting, it doesn't make much difference other than cache churn.

The Haskell capability that is committing a transaction will not yield to another Haskell thread while it is doing the commit. The OS thread may be preempted, but once commit starts the haskell scheduler is not invoked until after locks are released.

To get good performance from STM you must pay attention to what TVars are involved in a commit. All STM systems are working under the assumption of low contention, so you want to minimize "false" conflicts (conflicts that are not essential to the computation). Something like `TVar (HashMap k v)` will work pretty well for a low thread count, but every transaction that writes to that structure will be in conflict with every other transaction that accesses it. Pushing the `TVar` into the nodes of the structure reduces the possibilities for conflict, while increasing the amount of bookkeeping STM has to do. I would like to reduce the cost of that bookkeeping using better structures, but we need to do so without harming performance in the low TVar count case. Right now it is optimized for good cache performance with a handful of TVars.

There is another way to play with performance by moving work into and out of the transaction body. A transaction body that executes quickly will reach commit faster. But it may be delaying work that moves into another transaction. Forcing values at the right time can make a big difference.

Ryan

On Fri, Jul 24, 2020 at 2:14 AM Compl Yue via Haskell-Cafe < haskell-cafe@haskell.org> wrote:

...
Thanks Chris, I confess I didn't pay enough attention to STM specialized container libraries by far, I skimmed through the description of stm-containers and ttrie, and feel they would definitely improve my code's performance in case I limit the server's parallelism within hardware capabilities. That may because I'm still prototyping the api and infrastructure for correctness, so even `TVar (HashMap k v)` performs okay for me at the moment, only if at low contention (surely there're plenty of CPU cycles to be optimized out in next steps). I model my data after graph model, so most data, even most indices are localized to nodes and edges, those can be manipulated without conflict, that's why I assumed I have a low contention use case since the very beginning - until I found there are still (though minor) needs for global indices to guarantee global uniqueness, I feel faithful with stm-containers/ttrie to implement a more scalable global index data structure, thanks for hinting me.

So an evident solution comes into my mind now, is to run the server with a pool of tx processing threads, matching number of CPU cores, client RPC requests then get queued to be executed in some thread from the pool. But I'm really fond of the mechanism of M:N scheduler which solves massive/dynamic concurrency so elegantly. I had some good result with Go in this regard, and see GHC at par in doing this, I don't want to give up this enjoyable machinery.

But looked at the stm implementation in GHC, it seems written TVars are exclusively locked during commit of a tx, I suspect this is the culprit when there're large M lightweight threads scheduled upon a small N hardware capabilities, that is when a lightweight thread yield control during an stm transaction commit, the TVars it locked will stay so until it's scheduled again (and again) till it can finish the commit. This way, descheduled threads could hold live threads from progressing. I haven't gone into more details there, but wonder if there can be some improvement for GHC RTS to keep an stm committing thread from descheduled, but seemingly that may impose more starvation potential; or stm can be improved to have its TVar locks preemptable when the owner trec/thread is in descheduled state? Neither should be easy but I'd really love massive lightweight threads doing STM practically well.

Best regards,

Compl

On 2020/7/24 上午12:57, Christopher Allen wrote:

It seems like you know how to run practical tests for tuning thread count and contention for throughput. Part of the reason you haven't gotten a super clear answer is "it depends." You give up fairness when you use STM instead of MVars or equivalent structures. That means a long running transaction might get stampeded by many small ones invalidating it over and over. The long-running transaction might never clear if the small transactions keep moving the cheese. I mention this because transaction runtime and size and count all affect throughput and latency. What might be ideal for one pattern of work might not be ideal for another. Optimizing for overall throughput might make the contention and fairness problems worse too. I've done practical tests to optimize this in the past, both for STM in Haskell and for RDBMS workloads.

The next step is sometimes figuring out whether you really need a data structure within a single STM container or if perhaps you can break up your STM container boundaries into zones or regions that roughly map onto update boundaries. That should make the transactions churn less. On the outside chance you do need to touch more than one container in a transaction, well, they compose.

e.g. https://hackage.haskell.org/package/stm-containers https://hackage.haskell.org/package/ttrie

It also sounds a bit like your question bumps into Amdahl's Law a bit.

All else fails, stop using STM and find something more tuned to your problem space.

Hope this helps, Chris Allen

On Thu, Jul 23, 2020 at 9:53 AM YueCompl via Haskell-Cafe < haskell-cafe@haskell.org> wrote:

...
Hello Cafe,

I'm working on an in-memory database, in Client/Server mode I just let each connected client submit remote procedure call running in its dedicated lightweight thread, modifying TVars in RAM per its business needs, then in case many clients connected concurrently and trying to insert new data, if they are triggering global index (some TVar) update, the throughput would drop drastically. I reduced the shared state to a simple int counter by TVar, got same symptom. While the parallelism feels okay when there's no hot TVar conflicting, or M is not much greater than N.

As an empirical test workload, I have a `+RTS -N10` server process, it handles 10 concurrent clients okay, got ~5x of single thread throughput; but in handling 20 concurrent clients, each of the 10 CPUs can only be driven to ~10% utilization, the throughput seems even worse than single thread. More clients can even drive it thrashing without much progressing.

I can understand that pure STM doesn't scale well after reading [1], and I see it suggested [7] attractive and planned future work toward that direction.

But I can't find certain libraries or frameworks addressing large M over small N scenarios, [1] experimented with designated N parallelism, and [7] is rather theoretical to my empirical needs.

Can you direct me to some available library implementing the methodology proposed in [7] or other ways tackling this problem?

I think the most difficult one is that a transaction should commit with global indices (with possibly unique constraints) atomically updated, and rollback with any violation of constraints, i.e. transactions have to cover global states like indices. Other problems seem more trivial than this.

Specifically, [7] states:

...
It must be emphasized that all of the mechanisms we deploy originate, in one form or another, in the database literature from the 70s and 80s. Our contribution is to adapt these techniques to software transactional memory, providing more effective solutions to important STM problems than prior proposals.

I wonder any STM based library has simplified those techniques to be composed right away? I don't really want to implement those mechanisms by myself, rebuilding many wheels from scratch.

Best regards, Compl

[1] Comparing the performance of concurrent linked-list implementations in Haskell https://simonmar.github.io/bib/papers/concurrent-data.pdf

[7] M. Herlihy and E. Koskinen. Transactional boosting: a methodology for highly-concurrent transactional objects. In Proc. of PPoPP ’08, pages 207–216. ACM Press, 2008. https://www.cs.stevens.edu/~ejk/papers/boosting-ppopp08.pdf

_______________________________________________ Haskell-Cafe mailing list To (un)subscribe, modify options or view archives go to: http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe Only members subscribed via the mailman list are allowed to post.

-- Chris Allen Currently working on http://haskellbook.com

_______________________________________________ Haskell-Cafe mailing list To (un)subscribe, modify options or view archives go to: http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe Only members subscribed via the mailman list are allowed to post.

Compl Yue

25 Jul 25 Jul

6:04 a.m.

Shame on me for I have neither experienced with `perf`, I'd learn these essential tools soon to put them into good use. It's great to learn about how `orElse` actually works, I did get confused why there are so little retries captured, and now I know. So that little trick should definitely be removed before going production, as it does no much useful things at excessive cost. I put it there to help me understand internal working of stm, now I get even better knowledge ;-) I think a debugger will trap every single abort, isn't it annoying when many aborts would occur? If I'd like to count the number of aborts, ideally accounted per service endpoints, time periods, source modules etc. there some tricks for that? Thanks with best regards, Compl On 2020/7/25 上午2:02, Ryan Yates wrote:

...

To be clear, I was trying to refer to Linux `perf` [^1]. Sampling based profiling can do a good job with concurrent and parallel programs where other methods are problematic. For instance, changing the size of heap objects can drastically change cache performance and completely different behavior can show up.

[^1]: https://en.wikipedia.org/wiki/Perf_(Linux)

The spinning in `readTVar` should always be very short and it typically shows up as intensive CPU use, though it may not be high energy use with `pause` in the loop on x86 (looks like we don't have it [^2], I thought we did, but maybe that was only in some of my code... )

[^2]: https://github.com/ghc/ghc/blob/master/rts/STM.c#L1275

All that to say, I doubt that you are spending much time spinning (but it would certainly be interesting to know if you are! You would see `perf` attribute a large amount of time to `read_current_value`). The amount of code to execute for commit (the time when locks are held) is always much shorter than it takes to execute the transaction body. As you add more conflicting threads this gets worse of course as commits sequence.

The code you have will count commits of executions of `retry`. Note that `retry` is a user level idea, that is, you are counting user level *explicit* retries. This is different from a transaction failing to commit and starting again. These are invisible to the user. Also using your trace will convert `retry` from the efficient wake on write implementation, to an active retry that will always attempt again. We don't have cheap logging of transaction aborts in GHC, but I have built such logging in my work. You can observe these aborts with a debugger by looking for execution of this line:

https://github.com/ghc/ghc/blob/master/rts/STM.c#L1123

Ryan

On Fri, Jul 24, 2020 at 12:35 PM Compl Yue mailto:compl.yue@icloud.com> wrote:

I'm not familiar with profiling GHC yet, may need more time to get myself proficient with it.

And a bit more details of my test workload for diagnostic: the db clients are Python processes from a cluster of worker nodes, consulting the db server to register some path for data files, under a data dir within a shared filesystem, then mmap those data files and fill in actual array data. So the db server don't have much computation to perform, but puts the data file path into a global index, which at the same validates its uniqueness. As there are many client processes trying to insert one meta data record concurrently, with my naive implementation, the global index's TVar will almost always in locked state by one client after another, from a queue never fall empty.

So if `readTVar` should spinning waiting, I doubt the threads should actually make high CPU utilization, because at any instant of time, all threads except the committing one will be doing that one thing.

And I have something in my code to track STM retry like this:

```

-- blocking wait not expected, track stm retries explicitly trackSTM:: Int-> IO(Either() a) trackSTM !rtc = do when -- todo increase the threshold of reporting? (rtc > 0) $ do -- trace out the retries so the end users can be aware of them tid <- myThreadId trace ( "🔙\n" <> show callCtx <> "🌀 " <> show tid <> " stm retry #" <> show rtc ) $ return () atomically ((Just <$> stmJob) `orElse` return Nothing) >>= \case Nothing -> -- stm failed, do a tracked retry trackSTM (rtc + 1) Just ... -> ...

```

No such trace msg fires during my test, neither in single thread run, nor in runs with pressure. I'm sure this tracing mechanism works, as I can see such traces fire, in case e.g. posting a TMVar to a TQueue for some other thread to fill it, then read the result out, if these 2 ops are composed into a single tx, then of course it's infinite retry loop, and a sequence of such msgs are logged with ever increasing rtc #.

So I believe no retry has ever been triggered.

What can going on there?

On 2020/7/24 下午11:46, Ryan Yates wrote:

...
> Then to explain the low CPU utilization (~10%), am I right to understand it as that upon reading a TVar locked by another committing tx, a lightweight thread will put itself into `waiting STM` and descheduled state, so the CPUs can only stay idle as not so many threads are willing to proceed?

Since the commit happens in finite steps, the expectation is that the lock will be released very soon. Given this when the body of a transaction executes `readTVar` it spins (active CPU!) until the `TVar` is observed unlocked. If a lock is observed while commiting, it immediately starts the transaction again from the beginning. To get the behavior of suspending a transaction you have to successfully commit a transaction that executed `retry`. Then the transaction is put on the wakeup lists of its read set and subsequent commits will wake it up if its write set overlaps.

I don't think any of these things would explain low CPU utilization. You could try running with `perf` and see if lots of time is spent in some recognizable part of the RTS.

Ryan

On Fri, Jul 24, 2020 at 11:22 AM Compl Yue mailto:compl.yue@icloud.com> wrote:

Thanks very much for the insightful information Ryan! I'm glad my suspect was wrong about the Haskell scheduler:

> The Haskell capability that is committing a transaction will not yield to another Haskell thread while it is doing the commit. The OS thread may be preempted, but once commit starts the haskell scheduler is not invoked until after locks are released.

So best effort had already been made in GHC and I just need to cooperate better with its design. Then to explain the low CPU utilization (~10%), am I right to understand it as that upon reading a TVar locked by another committing tx, a lightweight thread will put itself into `waiting STM` and descheduled state, so the CPUs can only stay idle as not so many threads are willing to proceed?

Anyway, I see light with better data structures to improve my situation, let me try them and report back. Actually I later changed `TVar (HaskMap k v)` to be `TVar (HashMap k Int)` where the `Int` being array index into `TVar (Vector (TVar (Maybe v)))`, in pursuing insertion order preservation semantic of dict entries (like that in Python 3.7+), then it's very hopeful to incorporate stm-containers' Map or ttrie to approach free of contention.

Thanks with regards,

Compl

On 2020/7/24 下午10:03, Ryan Yates wrote:

...
Hi Compl,

Having a pool of transaction processing threads can be helpful in a certain way. If the body of the transaction takes more time to execute then the Haskell thread is allowed and it yields, the suspended thread won't get in the way of other thread, but when it is rescheduled, will have a low probability of success. Even worse, it will probably not discover that it is doomed to failure until commit time. If transactions are more likely to reach commit without yielding, they are more likely to succeed. If the transactions are not conflicting, it doesn't make much difference other than cache churn.

The Haskell capability that is committing a transaction will not yield to another Haskell thread while it is doing the commit. The OS thread may be preempted, but once commit starts the haskell scheduler is not invoked until after locks are released.

To get good performance from STM you must pay attention to what TVars are involved in a commit. All STM systems are working under the assumption of low contention, so you want to minimize "false" conflicts (conflicts that are not essential to the computation). Something like `TVar (HashMap k v)` will work pretty well for a low thread count, but every transaction that writes to that structure will be in conflict with every other transaction that accesses it. Pushing the `TVar` into the nodes of the structure reduces the possibilities for conflict, while increasing the amount of bookkeeping STM has to do. I would like to reduce the cost of that bookkeeping using better structures, but we need to do so without harming performance in the low TVar count case. Right now it is optimized for good cache performance with a handful of TVars.

There is another way to play with performance by moving work into and out of the transaction body. A transaction body that executes quickly will reach commit faster. But it may be delaying work that moves into another transaction. Forcing values at the right time can make a big difference.

Ryan

On Fri, Jul 24, 2020 at 2:14 AM Compl Yue via Haskell-Cafe mailto:haskell-cafe@haskell.org> wrote:

Thanks Chris, I confess I didn't pay enough attention to STM specialized container libraries by far, I skimmed through the description of stm-containers and ttrie, and feel they would definitely improve my code's performance in case I limit the server's parallelism within hardware capabilities. That may because I'm still prototyping the api and infrastructure for correctness, so even `TVar (HashMap k v)` performs okay for me at the moment, only if at low contention (surely there're plenty of CPU cycles to be optimized out in next steps). I model my data after graph model, so most data, even most indices are localized to nodes and edges, those can be manipulated without conflict, that's why I assumed I have a low contention use case since the very beginning - until I found there are still (though minor) needs for global indices to guarantee global uniqueness, I feel faithful with stm-containers/ttrie to implement a more scalable global index data structure, thanks for hinting me.

So an evident solution comes into my mind now, is to run the server with a pool of tx processing threads, matching number of CPU cores, client RPC requests then get queued to be executed in some thread from the pool. But I'm really fond of the mechanism of M:N scheduler which solves massive/dynamic concurrency so elegantly. I had some good result with Go in this regard, and see GHC at par in doing this, I don't want to give up this enjoyable machinery.

But looked at the stm implementation in GHC, it seems written TVars are exclusively locked during commit of a tx, I suspect this is the culprit when there're large M lightweight threads scheduled upon a small N hardware capabilities, that is when a lightweight thread yield control during an stm transaction commit, the TVars it locked will stay so until it's scheduled again (and again) till it can finish the commit. This way, descheduled threads could hold live threads from progressing. I haven't gone into more details there, but wonder if there can be some improvement for GHC RTS to keep an stm committing thread from descheduled, but seemingly that may impose more starvation potential; or stm can be improved to have its TVar locks preemptable when the owner trec/thread is in descheduled state? Neither should be easy but I'd really love massive lightweight threads doing STM practically well.

Best regards,

Compl

On 2020/7/24 上午12:57, Christopher Allen wrote:

...
It seems like you know how to run practical tests for tuning thread count and contention for throughput. Part of the reason you haven't gotten a super clear answer is "it depends." You give up fairness when you use STM instead of MVars or equivalent structures. That means a long running transaction might get stampeded by many small ones invalidating it over and over. The long-running transaction might never clear if the small transactions keep moving the cheese. I mention this because transaction runtime and size and count all affect throughput and latency. What might be ideal for one pattern of work might not be ideal for another. Optimizing for overall throughput might make the contention and fairness problems worse too. I've done practical tests to optimize this in the past, both for STM in Haskell and for RDBMS workloads.

The next step is sometimes figuring out whether you really need a data structure within a single STM container or if perhaps you can break up your STM container boundaries into zones or regions that roughly map onto update boundaries. That should make the transactions churn less. On the outside chance you do need to touch more than one container in a transaction, well, they compose.

e.g. https://hackage.haskell.org/package/stm-containers https://hackage.haskell.org/package/ttrie

It also sounds a bit like your question bumps into Amdahl's Law a bit.

All else fails, stop using STM and find something more tuned to your problem space.

Hope this helps, Chris Allen

On Thu, Jul 23, 2020 at 9:53 AM YueCompl via Haskell-Cafe mailto:haskell-cafe@haskell.org> wrote:

Hello Cafe,

I'm working on an in-memory database, in Client/Server mode I just let each connected client submit remote procedure call running in its dedicated lightweight thread, modifying TVars in RAM per its business needs, then in case many clients connected concurrently and trying to insert new data, if they are triggering global index (some TVar) update, the throughput would drop drastically. I reduced the shared state to a simple int counter by TVar, got same symptom. While the parallelism feels okay when there's no hot TVar conflicting, or M is not much greater than N.

As an empirical test workload, I have a `+RTS -N10` server process, it handles 10 concurrent clients okay, got ~5x of single thread throughput; but in handling 20 concurrent clients, each of the 10 CPUs can only be driven to ~10% utilization, the throughput seems even worse than single thread. More clients can even drive it thrashing without much progressing.

I can understand that pure STM doesn't scale well after reading [1], and I see it suggested [7] attractive and planned future work toward that direction.

But I can't find certain libraries or frameworks addressing large M over small N scenarios, [1] experimented with designated N parallelism, and [7] is rather theoretical to my empirical needs.

Can you direct me to some available library implementing the methodology proposed in [7] or other ways tackling this problem?

I think the most difficult one is that a transaction should commit with global indices (with possibly unique constraints) atomically updated, and rollback with any violation of constraints, i.e. transactions have to cover global states like indices. Other problems seem more trivial than this.

Specifically, [7] states:

> It must be emphasized that all of the mechanisms we deploy originate, in one form or another, in the database literature from the 70s and 80s. Our contribution is to adapt these techniques to software transactional memory, providing more effective solutions to important STM problems than prior proposals.

I wonder any STM based library has simplified those techniques to be composed right away? I don't really want to implement those mechanisms by myself, rebuilding many wheels from scratch.

Best regards, Compl

[1] Comparing the performance of concurrent linked-list implementations in Haskell https://simonmar.github.io/bib/papers/concurrent-data.pdf

[7] M. Herlihy and E. Koskinen. Transactional boosting: a methodology for highly-concurrent transactional objects. In Proc. of PPoPP ’08, pages 207–216. ACM Press, 2008. https://www.cs.stevens.edu/~ejk/papers/boosting-ppopp08.pdf

_______________________________________________ Haskell-Cafe mailing list To (un)subscribe, modify options or view archives go to: http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe Only members subscribed via the mailman list are allowed to post.

-- Chris Allen Currently working on http://haskellbook.com

_______________________________________________ Haskell-Cafe mailing list To (un)subscribe, modify options or view archives go to: http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe Only members subscribed via the mailman list are allowed to post.

Compl Yue

7:35 a.m.

New subject: STM friendly TreeMap (or similar with range scan api) ? WAS: Best ways to achieve throughput, for large M:N ratio of STM threads, with hot TVar updates?

Dear Cafe, As Chris Allen has suggested, I learned that https://hackage.haskell.org/package/stm-containers and https://hackage.haskell.org/package/ttrie can help a lot when used in place of traditional HashMap for stm tx processing, under heavy concurrency, yet still with automatic parallelism as GHC implemented them. Then I realized that in addition to hash map (used to implement dicts and scopes), I also need to find a TreeMap replacement data structure to implement the db index. I've been focusing on the uniqueness constraint aspect, but it's still an index, needs to provide range scan api for db clients, so hash map is not sufficient for the index. I see Ryan shared the code benchmarking RBTree with stm in mind: https://github.com/fryguybob/ghc-stm-benchmarks/tree/master/benchmarks/RBTre... https://github.com/fryguybob/ghc-stm-benchmarks/tree/master/benchmarks/RBTre... But can't find conclusion or interpretation of that benchmark suite. And here's a followup question: Where are some STM contention optimized data structures, that having keys ordered, with sub-range traversing api ? (of course production ready libraries most desirable) Thanks with regards, Compl On 2020/7/25 下午2:04, Compl Yue via Haskell-Cafe wrote:

...

Shame on me for I have neither experienced with `perf`, I'd learn these essential tools soon to put them into good use.

It's great to learn about how `orElse` actually works, I did get confused why there are so little retries captured, and now I know. So that little trick should definitely be removed before going production, as it does no much useful things at excessive cost. I put it there to help me understand internal working of stm, now I get even better knowledge ;-)

I think a debugger will trap every single abort, isn't it annoying when many aborts would occur? If I'd like to count the number of aborts, ideally accounted per service endpoints, time periods, source modules etc. there some tricks for that?

Thanks with best regards,

Compl

On 2020/7/25 上午2:02, Ryan Yates wrote:

...
To be clear, I was trying to refer to Linux `perf` [^1]. Sampling based profiling can do a good job with concurrent and parallel programs where other methods are problematic. For instance, changing the size of heap objects can drastically change cache performance and completely different behavior can show up.

[^1]: https://en.wikipedia.org/wiki/Perf_(Linux)

The spinning in `readTVar` should always be very short and it typically shows up as intensive CPU use, though it may not be high energy use with `pause` in the loop on x86 (looks like we don't have it [^2], I thought we did, but maybe that was only in some of my code... )

[^2]: https://github.com/ghc/ghc/blob/master/rts/STM.c#L1275

All that to say, I doubt that you are spending much time spinning (but it would certainly be interesting to know if you are! You would see `perf` attribute a large amount of time to `read_current_value`). The amount of code to execute for commit (the time when locks are held) is always much shorter than it takes to execute the transaction body. As you add more conflicting threads this gets worse of course as commits sequence.

The code you have will count commits of executions of `retry`. Note that `retry` is a user level idea, that is, you are counting user level *explicit* retries. This is different from a transaction failing to commit and starting again. These are invisible to the user. Also using your trace will convert `retry` from the efficient wake on write implementation, to an active retry that will always attempt again. We don't have cheap logging of transaction aborts in GHC, but I have built such logging in my work. You can observe these aborts with a debugger by looking for execution of this line:

https://github.com/ghc/ghc/blob/master/rts/STM.c#L1123

Ryan

On Fri, Jul 24, 2020 at 12:35 PM Compl Yue mailto:compl.yue@icloud.com> wrote:

I'm not familiar with profiling GHC yet, may need more time to get myself proficient with it.

And a bit more details of my test workload for diagnostic: the db clients are Python processes from a cluster of worker nodes, consulting the db server to register some path for data files, under a data dir within a shared filesystem, then mmap those data files and fill in actual array data. So the db server don't have much computation to perform, but puts the data file path into a global index, which at the same validates its uniqueness. As there are many client processes trying to insert one meta data record concurrently, with my naive implementation, the global index's TVar will almost always in locked state by one client after another, from a queue never fall empty.

So if `readTVar` should spinning waiting, I doubt the threads should actually make high CPU utilization, because at any instant of time, all threads except the committing one will be doing that one thing.

And I have something in my code to track STM retry like this:

```

-- blocking wait not expected, track stm retries explicitly trackSTM:: Int-> IO(Either() a) trackSTM !rtc = do when -- todo increase the threshold of reporting? (rtc > 0) $ do -- trace out the retries so the end users can be aware of them tid <- myThreadId trace ( "🔙\n" <> show callCtx <> "🌀 " <> show tid <> " stm retry #" <> show rtc ) $ return () atomically ((Just <$> stmJob) `orElse` return Nothing) >>= \case Nothing -> -- stm failed, do a tracked retry trackSTM (rtc + 1) Just ... -> ...

```

No such trace msg fires during my test, neither in single thread run, nor in runs with pressure. I'm sure this tracing mechanism works, as I can see such traces fire, in case e.g. posting a TMVar to a TQueue for some other thread to fill it, then read the result out, if these 2 ops are composed into a single tx, then of course it's infinite retry loop, and a sequence of such msgs are logged with ever increasing rtc #.

So I believe no retry has ever been triggered.

What can going on there?

On 2020/7/24 下午11:46, Ryan Yates wrote:

...
> Then to explain the low CPU utilization (~10%), am I right to understand it as that upon reading a TVar locked by another committing tx, a lightweight thread will put itself into `waiting STM` and descheduled state, so the CPUs can only stay idle as not so many threads are willing to proceed?

Since the commit happens in finite steps, the expectation is that the lock will be released very soon. Given this when the body of a transaction executes `readTVar` it spins (active CPU!) until the `TVar` is observed unlocked. If a lock is observed while commiting, it immediately starts the transaction again from the beginning. To get the behavior of suspending a transaction you have to successfully commit a transaction that executed `retry`. Then the transaction is put on the wakeup lists of its read set and subsequent commits will wake it up if its write set overlaps.

I don't think any of these things would explain low CPU utilization. You could try running with `perf` and see if lots of time is spent in some recognizable part of the RTS.

Ryan

On Fri, Jul 24, 2020 at 11:22 AM Compl Yue mailto:compl.yue@icloud.com> wrote:

Thanks very much for the insightful information Ryan! I'm glad my suspect was wrong about the Haskell scheduler:

> The Haskell capability that is committing a transaction will not yield to another Haskell thread while it is doing the commit. The OS thread may be preempted, but once commit starts the haskell scheduler is not invoked until after locks are released.

So best effort had already been made in GHC and I just need to cooperate better with its design. Then to explain the low CPU utilization (~10%), am I right to understand it as that upon reading a TVar locked by another committing tx, a lightweight thread will put itself into `waiting STM` and descheduled state, so the CPUs can only stay idle as not so many threads are willing to proceed?

Anyway, I see light with better data structures to improve my situation, let me try them and report back. Actually I later changed `TVar (HaskMap k v)` to be `TVar (HashMap k Int)` where the `Int` being array index into `TVar (Vector (TVar (Maybe v)))`, in pursuing insertion order preservation semantic of dict entries (like that in Python 3.7+), then it's very hopeful to incorporate stm-containers' Map or ttrie to approach free of contention.

Thanks with regards,

Compl

On 2020/7/24 下午10:03, Ryan Yates wrote:

...
Hi Compl,

Having a pool of transaction processing threads can be helpful in a certain way. If the body of the transaction takes more time to execute then the Haskell thread is allowed and it yields, the suspended thread won't get in the way of other thread, but when it is rescheduled, will have a low probability of success. Even worse, it will probably not discover that it is doomed to failure until commit time. If transactions are more likely to reach commit without yielding, they are more likely to succeed. If the transactions are not conflicting, it doesn't make much difference other than cache churn.

The Haskell capability that is committing a transaction will not yield to another Haskell thread while it is doing the commit. The OS thread may be preempted, but once commit starts the haskell scheduler is not invoked until after locks are released.

To get good performance from STM you must pay attention to what TVars are involved in a commit. All STM systems are working under the assumption of low contention, so you want to minimize "false" conflicts (conflicts that are not essential to the computation). Something like `TVar (HashMap k v)` will work pretty well for a low thread count, but every transaction that writes to that structure will be in conflict with every other transaction that accesses it. Pushing the `TVar` into the nodes of the structure reduces the possibilities for conflict, while increasing the amount of bookkeeping STM has to do. I would like to reduce the cost of that bookkeeping using better structures, but we need to do so without harming performance in the low TVar count case. Right now it is optimized for good cache performance with a handful of TVars.

There is another way to play with performance by moving work into and out of the transaction body. A transaction body that executes quickly will reach commit faster. But it may be delaying work that moves into another transaction. Forcing values at the right time can make a big difference.

Ryan

On Fri, Jul 24, 2020 at 2:14 AM Compl Yue via Haskell-Cafe mailto:haskell-cafe@haskell.org> wrote:

Thanks Chris, I confess I didn't pay enough attention to STM specialized container libraries by far, I skimmed through the description of stm-containers and ttrie, and feel they would definitely improve my code's performance in case I limit the server's parallelism within hardware capabilities. That may because I'm still prototyping the api and infrastructure for correctness, so even `TVar (HashMap k v)` performs okay for me at the moment, only if at low contention (surely there're plenty of CPU cycles to be optimized out in next steps). I model my data after graph model, so most data, even most indices are localized to nodes and edges, those can be manipulated without conflict, that's why I assumed I have a low contention use case since the very beginning - until I found there are still (though minor) needs for global indices to guarantee global uniqueness, I feel faithful with stm-containers/ttrie to implement a more scalable global index data structure, thanks for hinting me.

So an evident solution comes into my mind now, is to run the server with a pool of tx processing threads, matching number of CPU cores, client RPC requests then get queued to be executed in some thread from the pool. But I'm really fond of the mechanism of M:N scheduler which solves massive/dynamic concurrency so elegantly. I had some good result with Go in this regard, and see GHC at par in doing this, I don't want to give up this enjoyable machinery.

But looked at the stm implementation in GHC, it seems written TVars are exclusively locked during commit of a tx, I suspect this is the culprit when there're large M lightweight threads scheduled upon a small N hardware capabilities, that is when a lightweight thread yield control during an stm transaction commit, the TVars it locked will stay so until it's scheduled again (and again) till it can finish the commit. This way, descheduled threads could hold live threads from progressing. I haven't gone into more details there, but wonder if there can be some improvement for GHC RTS to keep an stm committing thread from descheduled, but seemingly that may impose more starvation potential; or stm can be improved to have its TVar locks preemptable when the owner trec/thread is in descheduled state? Neither should be easy but I'd really love massive lightweight threads doing STM practically well.

Best regards,

Compl

On 2020/7/24 上午12:57, Christopher Allen wrote:

...
It seems like you know how to run practical tests for tuning thread count and contention for throughput. Part of the reason you haven't gotten a super clear answer is "it depends." You give up fairness when you use STM instead of MVars or equivalent structures. That means a long running transaction might get stampeded by many small ones invalidating it over and over. The long-running transaction might never clear if the small transactions keep moving the cheese. I mention this because transaction runtime and size and count all affect throughput and latency. What might be ideal for one pattern of work might not be ideal for another. Optimizing for overall throughput might make the contention and fairness problems worse too. I've done practical tests to optimize this in the past, both for STM in Haskell and for RDBMS workloads.

The next step is sometimes figuring out whether you really need a data structure within a single STM container or if perhaps you can break up your STM container boundaries into zones or regions that roughly map onto update boundaries. That should make the transactions churn less. On the outside chance you do need to touch more than one container in a transaction, well, they compose.

e.g. https://hackage.haskell.org/package/stm-containers https://hackage.haskell.org/package/ttrie

It also sounds a bit like your question bumps into Amdahl's Law a bit.

All else fails, stop using STM and find something more tuned to your problem space.

Hope this helps, Chris Allen

On Thu, Jul 23, 2020 at 9:53 AM YueCompl via Haskell-Cafe mailto:haskell-cafe@haskell.org> wrote:

Hello Cafe,

I'm working on an in-memory database, in Client/Server mode I just let each connected client submit remote procedure call running in its dedicated lightweight thread, modifying TVars in RAM per its business needs, then in case many clients connected concurrently and trying to insert new data, if they are triggering global index (some TVar) update, the throughput would drop drastically. I reduced the shared state to a simple int counter by TVar, got same symptom. While the parallelism feels okay when there's no hot TVar conflicting, or M is not much greater than N.

As an empirical test workload, I have a `+RTS -N10` server process, it handles 10 concurrent clients okay, got ~5x of single thread throughput; but in handling 20 concurrent clients, each of the 10 CPUs can only be driven to ~10% utilization, the throughput seems even worse than single thread. More clients can even drive it thrashing without much progressing.

I can understand that pure STM doesn't scale well after reading [1], and I see it suggested [7] attractive and planned future work toward that direction.

But I can't find certain libraries or frameworks addressing large M over small N scenarios, [1] experimented with designated N parallelism, and [7] is rather theoretical to my empirical needs.

Can you direct me to some available library implementing the methodology proposed in [7] or other ways tackling this problem?

I think the most difficult one is that a transaction should commit with global indices (with possibly unique constraints) atomically updated, and rollback with any violation of constraints, i.e. transactions have to cover global states like indices. Other problems seem more trivial than this.

Specifically, [7] states:

> It must be emphasized that all of the mechanisms we deploy originate, in one form or another, in the database literature from the 70s and 80s. Our contribution is to adapt these techniques to software transactional memory, providing more effective solutions to important STM problems than prior proposals.

I wonder any STM based library has simplified those techniques to be composed right away? I don't really want to implement those mechanisms by myself, rebuilding many wheels from scratch.

Best regards, Compl

[1] Comparing the performance of concurrent linked-list implementations in Haskell https://simonmar.github.io/bib/papers/concurrent-data.pdf

[7] M. Herlihy and E. Koskinen. Transactional boosting: a methodology for highly-concurrent transactional objects. In Proc. of PPoPP ’08, pages 207–216. ACM Press, 2008. https://www.cs.stevens.edu/~ejk/papers/boosting-ppopp08.pdf

_______________________________________________ Haskell-Cafe mailing list To (un)subscribe, modify options or view archives go to: http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe Only members subscribed via the mailman list are allowed to post.

-- Chris Allen Currently working on http://haskellbook.com

_______________________________________________ Haskell-Cafe mailing list To (un)subscribe, modify options or view archives go to: http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe Only members subscribed via the mailman list are allowed to post.

_______________________________________________ Haskell-Cafe mailing list To (un)subscribe, modify options or view archives go to: http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe Only members subscribed via the mailman list are allowed to post.

Ryan Yates

2:07 p.m.

New subject: STM friendly TreeMap (or similar with range scan api) ? WAS: Best ways to achieve throughput, for large M:N ratio of STM threads, with hot TVar updates?

Unfortunately my STM benchmarks are rather disorganized. The most relevant paper using them is: Leveraging hardware TM in Haskell (PPoPP '19) https://dl.acm.org/doi/10.1145/3293883.3295711 Or my thesis: https://urresearch.rochester.edu/institutionalPublicationPublicView.action?i... The PPoPP benchmarks are on a branch (or the releases tab on github): https://github.com/fryguybob/ghc-stm-benchmarks/tree/wip/mutable-fields/benc... All that to say, without an implementation of mutable constructor fields (which I'm working on getting into GHC) the scaling is limited. Ryan On Sat, Jul 25, 2020 at 3:45 AM Compl Yue via Haskell-Cafe < haskell-cafe@haskell.org> wrote:

...

Dear Cafe,

As Chris Allen has suggested, I learned that https://hackage.haskell.org/package/stm-containers and https://hackage.haskell.org/package/ttrie can help a lot when used in place of traditional HashMap for stm tx processing, under heavy concurrency, yet still with automatic parallelism as GHC implemented them. Then I realized that in addition to hash map (used to implement dicts and scopes), I also need to find a TreeMap replacement data structure to implement the db index. I've been focusing on the uniqueness constraint aspect, but it's still an index, needs to provide range scan api for db clients, so hash map is not sufficient for the index.

I see Ryan shared the code benchmarking RBTree with stm in mind:

https://github.com/fryguybob/ghc-stm-benchmarks/tree/master/benchmarks/RBTre...

https://github.com/fryguybob/ghc-stm-benchmarks/tree/master/benchmarks/RBTre...

But can't find conclusion or interpretation of that benchmark suite. And here's a followup question:

Where are some STM contention optimized data structures, that having keys ordered, with sub-range traversing api ?

(of course production ready libraries most desirable)

Thanks with regards,

Compl

On 2020/7/25 下午2:04, Compl Yue via Haskell-Cafe wrote:

Shame on me for I have neither experienced with `perf`, I'd learn these essential tools soon to put them into good use.

It's great to learn about how `orElse` actually works, I did get confused why there are so little retries captured, and now I know. So that little trick should definitely be removed before going production, as it does no much useful things at excessive cost. I put it there to help me understand internal working of stm, now I get even better knowledge ;-)

I think a debugger will trap every single abort, isn't it annoying when many aborts would occur? If I'd like to count the number of aborts, ideally accounted per service endpoints, time periods, source modules etc. there some tricks for that?

Thanks with best regards,

Compl

On 2020/7/25 上午2:02, Ryan Yates wrote:

To be clear, I was trying to refer to Linux `perf` [^1]. Sampling based profiling can do a good job with concurrent and parallel programs where other methods are problematic. For instance, changing the size of heap objects can drastically change cache performance and completely different behavior can show up.

[^1]: https://en.wikipedia.org/wiki/Perf_(Linux)

The spinning in `readTVar` should always be very short and it typically shows up as intensive CPU use, though it may not be high energy use with `pause` in the loop on x86 (looks like we don't have it [^2], I thought we did, but maybe that was only in some of my code... )

[^2]: https://github.com/ghc/ghc/blob/master/rts/STM.c#L1275

All that to say, I doubt that you are spending much time spinning (but it would certainly be interesting to know if you are! You would see `perf` attribute a large amount of time to `read_current_value`). The amount of code to execute for commit (the time when locks are held) is always much shorter than it takes to execute the transaction body. As you add more conflicting threads this gets worse of course as commits sequence.

The code you have will count commits of executions of `retry`. Note that `retry` is a user level idea, that is, you are counting user level *explicit* retries. This is different from a transaction failing to commit and starting again. These are invisible to the user. Also using your trace will convert `retry` from the efficient wake on write implementation, to an active retry that will always attempt again. We don't have cheap logging of transaction aborts in GHC, but I have built such logging in my work. You can observe these aborts with a debugger by looking for execution of this line:

https://github.com/ghc/ghc/blob/master/rts/STM.c#L1123

Ryan

On Fri, Jul 24, 2020 at 12:35 PM Compl Yue wrote:

...
I'm not familiar with profiling GHC yet, may need more time to get myself proficient with it.

And a bit more details of my test workload for diagnostic: the db clients are Python processes from a cluster of worker nodes, consulting the db server to register some path for data files, under a data dir within a shared filesystem, then mmap those data files and fill in actual array data. So the db server don't have much computation to perform, but puts the data file path into a global index, which at the same validates its uniqueness. As there are many client processes trying to insert one meta data record concurrently, with my naive implementation, the global index's TVar will almost always in locked state by one client after another, from a queue never fall empty.

So if `readTVar` should spinning waiting, I doubt the threads should actually make high CPU utilization, because at any instant of time, all threads except the committing one will be doing that one thing.

And I have something in my code to track STM retry like this:

``` -- blocking wait not expected, track stm retries explicitly trackSTM :: Int -> IO (Either () a) trackSTM !rtc = do when -- todo increase the threshold of reporting? (rtc > 0) $ do -- trace out the retries so the end users can be aware of them tid <- myThreadId trace ( "🔙\n" <> show callCtx <> "🌀 " <> show tid <> " stm retry #" <> show rtc ) $ return () atomically ((Just <$> stmJob) `orElse` return Nothing) >>= \case Nothing -> -- stm failed, do a tracked retry trackSTM (rtc + 1) Just ... -> ...

```

No such trace msg fires during my test, neither in single thread run, nor in runs with pressure. I'm sure this tracing mechanism works, as I can see such traces fire, in case e.g. posting a TMVar to a TQueue for some other thread to fill it, then read the result out, if these 2 ops are composed into a single tx, then of course it's infinite retry loop, and a sequence of such msgs are logged with ever increasing rtc #.

So I believe no retry has ever been triggered.

What can going on there?

On 2020/7/24 下午11:46, Ryan Yates wrote:

...
Then to explain the low CPU utilization (~10%), am I right to understand it as that upon reading a TVar locked by another committing tx, a lightweight thread will put itself into `waiting STM` and descheduled state, so the CPUs can only stay idle as not so many threads are willing to proceed?

Since the commit happens in finite steps, the expectation is that the lock will be released very soon. Given this when the body of a transaction executes `readTVar` it spins (active CPU!) until the `TVar` is observed unlocked. If a lock is observed while commiting, it immediately starts the transaction again from the beginning. To get the behavior of suspending a transaction you have to successfully commit a transaction that executed `retry`. Then the transaction is put on the wakeup lists of its read set and subsequent commits will wake it up if its write set overlaps.

I don't think any of these things would explain low CPU utilization. You could try running with `perf` and see if lots of time is spent in some recognizable part of the RTS.

Ryan

On Fri, Jul 24, 2020 at 11:22 AM Compl Yue wrote:

...
Thanks very much for the insightful information Ryan! I'm glad my suspect was wrong about the Haskell scheduler:

...
The Haskell capability that is committing a transaction will not yield to another Haskell thread while it is doing the commit. The OS thread may be preempted, but once commit starts the haskell scheduler is not invoked until after locks are released. So best effort had already been made in GHC and I just need to cooperate better with its design. Then to explain the low CPU utilization (~10%), am I right to understand it as that upon reading a TVar locked by another committing tx, a lightweight thread will put itself into `waiting STM` and descheduled state, so the CPUs can only stay idle as not so many threads are willing to proceed?

Anyway, I see light with better data structures to improve my situation, let me try them and report back. Actually I later changed `TVar (HaskMap k v)` to be `TVar (HashMap k Int)` where the `Int` being array index into `TVar (Vector (TVar (Maybe v)))`, in pursuing insertion order preservation semantic of dict entries (like that in Python 3.7+), then it's very hopeful to incorporate stm-containers' Map or ttrie to approach free of contention.

Thanks with regards,

Compl

On 2020/7/24 下午10:03, Ryan Yates wrote:

Hi Compl,

Having a pool of transaction processing threads can be helpful in a certain way. If the body of the transaction takes more time to execute then the Haskell thread is allowed and it yields, the suspended thread won't get in the way of other thread, but when it is rescheduled, will have a low probability of success. Even worse, it will probably not discover that it is doomed to failure until commit time. If transactions are more likely to reach commit without yielding, they are more likely to succeed. If the transactions are not conflicting, it doesn't make much difference other than cache churn.

The Haskell capability that is committing a transaction will not yield to another Haskell thread while it is doing the commit. The OS thread may be preempted, but once commit starts the haskell scheduler is not invoked until after locks are released.

To get good performance from STM you must pay attention to what TVars are involved in a commit. All STM systems are working under the assumption of low contention, so you want to minimize "false" conflicts (conflicts that are not essential to the computation). Something like `TVar (HashMap k v)` will work pretty well for a low thread count, but every transaction that writes to that structure will be in conflict with every other transaction that accesses it. Pushing the `TVar` into the nodes of the structure reduces the possibilities for conflict, while increasing the amount of bookkeeping STM has to do. I would like to reduce the cost of that bookkeeping using better structures, but we need to do so without harming performance in the low TVar count case. Right now it is optimized for good cache performance with a handful of TVars.

There is another way to play with performance by moving work into and out of the transaction body. A transaction body that executes quickly will reach commit faster. But it may be delaying work that moves into another transaction. Forcing values at the right time can make a big difference.

Ryan

On Fri, Jul 24, 2020 at 2:14 AM Compl Yue via Haskell-Cafe < haskell-cafe@haskell.org> wrote:

...
Thanks Chris, I confess I didn't pay enough attention to STM specialized container libraries by far, I skimmed through the description of stm-containers and ttrie, and feel they would definitely improve my code's performance in case I limit the server's parallelism within hardware capabilities. That may because I'm still prototyping the api and infrastructure for correctness, so even `TVar (HashMap k v)` performs okay for me at the moment, only if at low contention (surely there're plenty of CPU cycles to be optimized out in next steps). I model my data after graph model, so most data, even most indices are localized to nodes and edges, those can be manipulated without conflict, that's why I assumed I have a low contention use case since the very beginning - until I found there are still (though minor) needs for global indices to guarantee global uniqueness, I feel faithful with stm-containers/ttrie to implement a more scalable global index data structure, thanks for hinting me.

So an evident solution comes into my mind now, is to run the server with a pool of tx processing threads, matching number of CPU cores, client RPC requests then get queued to be executed in some thread from the pool. But I'm really fond of the mechanism of M:N scheduler which solves massive/dynamic concurrency so elegantly. I had some good result with Go in this regard, and see GHC at par in doing this, I don't want to give up this enjoyable machinery.

But looked at the stm implementation in GHC, it seems written TVars are exclusively locked during commit of a tx, I suspect this is the culprit when there're large M lightweight threads scheduled upon a small N hardware capabilities, that is when a lightweight thread yield control during an stm transaction commit, the TVars it locked will stay so until it's scheduled again (and again) till it can finish the commit. This way, descheduled threads could hold live threads from progressing. I haven't gone into more details there, but wonder if there can be some improvement for GHC RTS to keep an stm committing thread from descheduled, but seemingly that may impose more starvation potential; or stm can be improved to have its TVar locks preemptable when the owner trec/thread is in descheduled state? Neither should be easy but I'd really love massive lightweight threads doing STM practically well.

Best regards,

Compl

On 2020/7/24 上午12:57, Christopher Allen wrote:

It seems like you know how to run practical tests for tuning thread count and contention for throughput. Part of the reason you haven't gotten a super clear answer is "it depends." You give up fairness when you use STM instead of MVars or equivalent structures. That means a long running transaction might get stampeded by many small ones invalidating it over and over. The long-running transaction might never clear if the small transactions keep moving the cheese. I mention this because transaction runtime and size and count all affect throughput and latency. What might be ideal for one pattern of work might not be ideal for another. Optimizing for overall throughput might make the contention and fairness problems worse too. I've done practical tests to optimize this in the past, both for STM in Haskell and for RDBMS workloads.

The next step is sometimes figuring out whether you really need a data structure within a single STM container or if perhaps you can break up your STM container boundaries into zones or regions that roughly map onto update boundaries. That should make the transactions churn less. On the outside chance you do need to touch more than one container in a transaction, well, they compose.

e.g. https://hackage.haskell.org/package/stm-containers https://hackage.haskell.org/package/ttrie

It also sounds a bit like your question bumps into Amdahl's Law a bit.

All else fails, stop using STM and find something more tuned to your problem space.

Hope this helps, Chris Allen

On Thu, Jul 23, 2020 at 9:53 AM YueCompl via Haskell-Cafe < haskell-cafe@haskell.org> wrote:

...
Hello Cafe,

I'm working on an in-memory database, in Client/Server mode I just let each connected client submit remote procedure call running in its dedicated lightweight thread, modifying TVars in RAM per its business needs, then in case many clients connected concurrently and trying to insert new data, if they are triggering global index (some TVar) update, the throughput would drop drastically. I reduced the shared state to a simple int counter by TVar, got same symptom. While the parallelism feels okay when there's no hot TVar conflicting, or M is not much greater than N.

As an empirical test workload, I have a `+RTS -N10` server process, it handles 10 concurrent clients okay, got ~5x of single thread throughput; but in handling 20 concurrent clients, each of the 10 CPUs can only be driven to ~10% utilization, the throughput seems even worse than single thread. More clients can even drive it thrashing without much progressing.

I can understand that pure STM doesn't scale well after reading [1], and I see it suggested [7] attractive and planned future work toward that direction.

But I can't find certain libraries or frameworks addressing large M over small N scenarios, [1] experimented with designated N parallelism, and [7] is rather theoretical to my empirical needs.

Can you direct me to some available library implementing the methodology proposed in [7] or other ways tackling this problem?

I think the most difficult one is that a transaction should commit with global indices (with possibly unique constraints) atomically updated, and rollback with any violation of constraints, i.e. transactions have to cover global states like indices. Other problems seem more trivial than this.

Specifically, [7] states:

...
It must be emphasized that all of the mechanisms we deploy originate, in one form or another, in the database literature from the 70s and 80s. Our contribution is to adapt these techniques to software transactional memory, providing more effective solutions to important STM problems than prior proposals.

I wonder any STM based library has simplified those techniques to be composed right away? I don't really want to implement those mechanisms by myself, rebuilding many wheels from scratch.

Best regards, Compl

[1] Comparing the performance of concurrent linked-list implementations in Haskell https://simonmar.github.io/bib/papers/concurrent-data.pdf

[7] M. Herlihy and E. Koskinen. Transactional boosting: a methodology for highly-concurrent transactional objects. In Proc. of PPoPP ’08, pages 207–216. ACM Press, 2008. https://www.cs.stevens.edu/~ejk/papers/boosting-ppopp08.pdf

_______________________________________________ Haskell-Cafe mailing list To (un)subscribe, modify options or view archives go to: http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe Only members subscribed via the mailman list are allowed to post.

-- Chris Allen Currently working on http://haskellbook.com

_______________________________________________ Haskell-Cafe mailing list To (un)subscribe, modify options or view archives go to: http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe Only members subscribed via the mailman list are allowed to post.

Haskell-Cafe mailing list To (un)subscribe, modify options or view archives go to:http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe Only members subscribed via the mailman list are allowed to post.

_______________________________________________ Haskell-Cafe mailing list To (un)subscribe, modify options or view archives go to: http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe Only members subscribed via the mailman list are allowed to post.

YueCompl

29 Jul 29 Jul

2:23 p.m.

New subject: STM friendly TreeMap (or similar with range scan api) ? WAS: Best ways to achieve throughput, for large M:N ratio of STM threads, with hot TVar updates?

Hi Cafe and Ryan, I tried Map/Set from stm-containers and TSkipList (added range scan api against its internal data structure) from http://hackage.haskell.org/package/tskiplist http://hackage.haskell.org/package/tskiplist , with them I've got quite improved at scalability on concurrency. But unfortunately then I hit another wall at single thread scalability over working memory size, I suspect it's because massively more TVars (those being pointers per se) are introduced by those "contention-free" data structures, they need to mutate separate pointers concurrently in avoiding contentions anyway, but such pointer-intensive heap seems imposing extraordinary pressure to GHC's garbage collector, that GC will dominate CPU utilization with poor business progress. For example in my test, I use `+RTS -H2g` for the Haskell server process, so GC is not triggered until after a while, then spin off 3 Python client to insert new records concurrently, in the first stage each Python process happily taking ~90% CPU filling (through local mmap) the arrays allocated from the server and logs of success scroll quickly, while the server process utilizes only 30~40% CPU to serve those 3 clients (insert meta data records into unique indices merely); then the client processes' CPU utilization drop drastically once Haskell server process' private memory reached around 2gb, i.e. GC started engaging, the server process's CPU utilization quickly approaches ~300%, while all client processes' drop to 0% for most of the time, and occasionally burst a tiny while with some log output showing progress. And I disable parallel GC lately, enabling parallel GC only makes it worse. If I comment out the code updating the indices (those creating many TVars), the overall throughput only drop slowly as more data are inserted, the parallelism feels steady even after the server process' private memory takes several GBs. I didn't expect this, but appears to me that GC of GHC is really not good at handling massive number of pointers in the heap, while those pointers are essential to reduce contention (and maybe expensive data copying too) at heavy parallelism/concurrency. Btw I tried `+RTS -xn` with GHC 8.10.1 too, no obvious different behavior compared to 8.8.3; and also tried tweaking GC related RTS options a bit, including increasing -G up to 10, no much difference too. I feel hopeless at the moment, wondering if I'll have to rewrite this in-memory db in Go/Rust or some other runtime ... Btw I read https://tech.channable.com/posts/2020-04-07-lessons-in-managing-haskell-memo... https://tech.channable.com/posts/2020-04-07-lessons-in-managing-haskell-memo... in searching about the symptoms, and don't feel likely to convert my DB managed data into immutable types thus to fit into Compact Regions, not quite likely a live in-mem database instance can do. So seems there are good reasons no successful DBMS, at least in-memory ones have been written in Haskell. Best regards, Compl

...

On 2020-07-25, at 22:07, Ryan Yates wrote:

Unfortunately my STM benchmarks are rather disorganized. The most relevant paper using them is:

Leveraging hardware TM in Haskell (PPoPP '19) https://dl.acm.org/doi/10.1145/3293883.3295711 https://dl.acm.org/doi/10.1145/3293883.3295711

Or my thesis: https://urresearch.rochester.edu/institutionalPublicationPublicView.action?i... https://urresearch.rochester.edu/institutionalPublicationPublicView.action?i...

The PPoPP benchmarks are on a branch (or the releases tab on github): https://github.com/fryguybob/ghc-stm-benchmarks/tree/wip/mutable-fields/benc... https://github.com/fryguybob/ghc-stm-benchmarks/tree/wip/mutable-fields/benc...

All that to say, without an implementation of mutable constructor fields (which I'm working on getting into GHC) the scaling is limited.

Ryan

On Sat, Jul 25, 2020 at 3:45 AM Compl Yue via Haskell-Cafe mailto:haskell-cafe@haskell.org> wrote: Dear Cafe,

As Chris Allen has suggested, I learned that https://hackage.haskell.org/package/stm-containers https://hackage.haskell.org/package/stm-containers and https://hackage.haskell.org/package/ttrie https://hackage.haskell.org/package/ttrie can help a lot when used in place of traditional HashMap for stm tx processing, under heavy concurrency, yet still with automatic parallelism as GHC implemented them. Then I realized that in addition to hash map (used to implement dicts and scopes), I also need to find a TreeMap replacement data structure to implement the db index. I've been focusing on the uniqueness constraint aspect, but it's still an index, needs to provide range scan api for db clients, so hash map is not sufficient for the index.

I see Ryan shared the code benchmarking RBTree with stm in mind:

https://github.com/fryguybob/ghc-stm-benchmarks/tree/master/benchmarks/RBTre... https://github.com/fryguybob/ghc-stm-benchmarks/tree/master/benchmarks/RBTre...

https://github.com/fryguybob/ghc-stm-benchmarks/tree/master/benchmarks/RBTre... https://github.com/fryguybob/ghc-stm-benchmarks/tree/master/benchmarks/RBTre... But can't find conclusion or interpretation of that benchmark suite. And here's a followup question:

Where are some STM contention optimized data structures, that having keys ordered, with sub-range traversing api ?

(of course production ready libraries most desirable)

Thanks with regards,

Compl

On 2020/7/25 下午2:04, Compl Yue via Haskell-Cafe wrote:

...
Shame on me for I have neither experienced with `perf`, I'd learn these essential tools soon to put them into good use.

It's great to learn about how `orElse` actually works, I did get confused why there are so little retries captured, and now I know. So that little trick should definitely be removed before going production, as it does no much useful things at excessive cost. I put it there to help me understand internal working of stm, now I get even better knowledge ;-)

I think a debugger will trap every single abort, isn't it annoying when many aborts would occur? If I'd like to count the number of aborts, ideally accounted per service endpoints, time periods, source modules etc. there some tricks for that?

Thanks with best regards,

Compl

On 2020/7/25 上午2:02, Ryan Yates wrote:

...
To be clear, I was trying to refer to Linux `perf` [^1]. Sampling based profiling can do a good job with concurrent and parallel programs where other methods are problematic. For instance, changing the size of heap objects can drastically change cache performance and completely different behavior can show up.

[^1]: https://en.wikipedia.org/wiki/Perf_(Linux) https://en.wikipedia.org/wiki/Perf_(Linux)

The spinning in `readTVar` should always be very short and it typically shows up as intensive CPU use, though it may not be high energy use with `pause` in the loop on x86 (looks like we don't have it [^2], I thought we did, but maybe that was only in some of my code... )

[^2]: https://github.com/ghc/ghc/blob/master/rts/STM.c#L1275 https://github.com/ghc/ghc/blob/master/rts/STM.c#L1275

All that to say, I doubt that you are spending much time spinning (but it would certainly be interesting to know if you are! You would see `perf` attribute a large amount of time to `read_current_value`). The amount of code to execute for commit (the time when locks are held) is always much shorter than it takes to execute the transaction body. As you add more conflicting threads this gets worse of course as commits sequence.

The code you have will count commits of executions of `retry`. Note that `retry` is a user level idea, that is, you are counting user level *explicit* retries. This is different from a transaction failing to commit and starting again. These are invisible to the user. Also using your trace will convert `retry` from the efficient wake on write implementation, to an active retry that will always attempt again. We don't have cheap logging of transaction aborts in GHC, but I have built such logging in my work. You can observe these aborts with a debugger by looking for execution of this line:

https://github.com/ghc/ghc/blob/master/rts/STM.c#L1123 https://github.com/ghc/ghc/blob/master/rts/STM.c#L1123

Ryan

On Fri, Jul 24, 2020 at 12:35 PM Compl Yue mailto:compl.yue@icloud.com> wrote: I'm not familiar with profiling GHC yet, may need more time to get myself proficient with it.

And a bit more details of my test workload for diagnostic: the db clients are Python processes from a cluster of worker nodes, consulting the db server to register some path for data files, under a data dir within a shared filesystem, then mmap those data files and fill in actual array data. So the db server don't have much computation to perform, but puts the data file path into a global index, which at the same validates its uniqueness. As there are many client processes trying to insert one meta data record concurrently, with my naive implementation, the global index's TVar will almost always in locked state by one client after another, from a queue never fall empty.

So if `readTVar` should spinning waiting, I doubt the threads should actually make high CPU utilization, because at any instant of time, all threads except the committing one will be doing that one thing.

And I have something in my code to track STM retry like this:

```

-- blocking wait not expected, track stm retries explicitly trackSTM :: Int -> IO (Either () a) trackSTM !rtc = do

when -- todo increase the threshold of reporting? (rtc > 0) $ do -- trace out the retries so the end users can be aware of them tid <- myThreadId trace ( "🔙\n" <> show callCtx <> "🌀 " <> show tid <> " stm retry #" <> show rtc ) $ return ()

atomically ((Just <$> stmJob) `orElse` return Nothing) >>= \case Nothing -> -- stm failed, do a tracked retry trackSTM (rtc + 1) Just ... -> ...

```

No such trace msg fires during my test, neither in single thread run, nor in runs with pressure. I'm sure this tracing mechanism works, as I can see such traces fire, in case e.g. posting a TMVar to a TQueue for some other thread to fill it, then read the result out, if these 2 ops are composed into a single tx, then of course it's infinite retry loop, and a sequence of such msgs are logged with ever increasing rtc #.

So I believe no retry has ever been triggered.

What can going on there?

On 2020/7/24 下午11:46, Ryan Yates wrote:

...
...
Then to explain the low CPU utilization (~10%), am I right to understand it as that upon reading a TVar locked by another committing tx, a lightweight thread will put itself into `waiting STM` and descheduled state, so the CPUs can only stay idle as not so many threads are willing to proceed?

Since the commit happens in finite steps, the expectation is that the lock will be released very soon. Given this when the body of a transaction executes `readTVar` it spins (active CPU!) until the `TVar` is observed unlocked. If a lock is observed while commiting, it immediately starts the transaction again from the beginning. To get the behavior of suspending a transaction you have to successfully commit a transaction that executed `retry`. Then the transaction is put on the wakeup lists of its read set and subsequent commits will wake it up if its write set overlaps.

I don't think any of these things would explain low CPU utilization. You could try running with `perf` and see if lots of time is spent in some recognizable part of the RTS.

Ryan

On Fri, Jul 24, 2020 at 11:22 AM Compl Yue mailto:compl.yue@icloud.com> wrote: Thanks very much for the insightful information Ryan! I'm glad my suspect was wrong about the Haskell scheduler:

...
The Haskell capability that is committing a transaction will not yield to another Haskell thread while it is doing the commit. The OS thread may be preempted, but once commit starts the haskell scheduler is not invoked until after locks are released.

So best effort had already been made in GHC and I just need to cooperate better with its design. Then to explain the low CPU utilization (~10%), am I right to understand it as that upon reading a TVar locked by another committing tx, a lightweight thread will put itself into `waiting STM` and descheduled state, so the CPUs can only stay idle as not so many threads are willing to proceed?

Anyway, I see light with better data structures to improve my situation, let me try them and report back. Actually I later changed `TVar (HaskMap k v)` to be `TVar (HashMap k Int)` where the `Int` being array index into `TVar (Vector (TVar (Maybe v)))`, in pursuing insertion order preservation semantic of dict entries (like that in Python 3.7+), then it's very hopeful to incorporate stm-containers' Map or ttrie to approach free of contention. Thanks with regards,

Compl

On 2020/7/24 下午10:03, Ryan Yates wrote:

...
Hi Compl,

Having a pool of transaction processing threads can be helpful in a certain way. If the body of the transaction takes more time to execute then the Haskell thread is allowed and it yields, the suspended thread won't get in the way of other thread, but when it is rescheduled, will have a low probability of success. Even worse, it will probably not discover that it is doomed to failure until commit time. If transactions are more likely to reach commit without yielding, they are more likely to succeed. If the transactions are not conflicting, it doesn't make much difference other than cache churn.

The Haskell capability that is committing a transaction will not yield to another Haskell thread while it is doing the commit. The OS thread may be preempted, but once commit starts the haskell scheduler is not invoked until after locks are released.

To get good performance from STM you must pay attention to what TVars are involved in a commit. All STM systems are working under the assumption of low contention, so you want to minimize "false" conflicts (conflicts that are not essential to the computation). Something like `TVar (HashMap k v)` will work pretty well for a low thread count, but every transaction that writes to that structure will be in conflict with every other transaction that accesses it. Pushing the `TVar` into the nodes of the structure reduces the possibilities for conflict, while increasing the amount of bookkeeping STM has to do. I would like to reduce the cost of that bookkeeping using better structures, but we need to do so without harming performance in the low TVar count case. Right now it is optimized for good cache performance with a handful of TVars.

There is another way to play with performance by moving work into and out of the transaction body. A transaction body that executes quickly will reach commit faster. But it may be delaying work that moves into another transaction. Forcing values at the right time can make a big difference.

Ryan

On Fri, Jul 24, 2020 at 2:14 AM Compl Yue via Haskell-Cafe mailto:haskell-cafe@haskell.org> wrote: Thanks Chris, I confess I didn't pay enough attention to STM specialized container libraries by far, I skimmed through the description of stm-containers and ttrie, and feel they would definitely improve my code's performance in case I limit the server's parallelism within hardware capabilities. That may because I'm still prototyping the api and infrastructure for correctness, so even `TVar (HashMap k v)` performs okay for me at the moment, only if at low contention (surely there're plenty of CPU cycles to be optimized out in next steps). I model my data after graph model, so most data, even most indices are localized to nodes and edges, those can be manipulated without conflict, that's why I assumed I have a low contention use case since the very beginning - until I found there are still (though minor) needs for global indices to guarantee global uniqueness, I feel faithful with stm-containers/ttrie to implement a more scalable global index data structure, thanks for hinting me.

So an evident solution comes into my mind now, is to run the server with a pool of tx processing threads, matching number of CPU cores, client RPC requests then get queued to be executed in some thread from the pool. But I'm really fond of the mechanism of M:N scheduler which solves massive/dynamic concurrency so elegantly. I had some good result with Go in this regard, and see GHC at par in doing this, I don't want to give up this enjoyable machinery.

But looked at the stm implementation in GHC, it seems written TVars are exclusively locked during commit of a tx, I suspect this is the culprit when there're large M lightweight threads scheduled upon a small N hardware capabilities, that is when a lightweight thread yield control during an stm transaction commit, the TVars it locked will stay so until it's scheduled again (and again) till it can finish the commit. This way, descheduled threads could hold live threads from progressing. I haven't gone into more details there, but wonder if there can be some improvement for GHC RTS to keep an stm committing thread from descheduled, but seemingly that may impose more starvation potential; or stm can be improved to have its TVar locks preemptable when the owner trec/thread is in descheduled state? Neither should be easy but I'd really love massive lightweight threads doing STM practically well.

Best regards,

Compl

On 2020/7/24 上午12:57, Christopher Allen wrote:

...
It seems like you know how to run practical tests for tuning thread count and contention for throughput. Part of the reason you haven't gotten a super clear answer is "it depends." You give up fairness when you use STM instead of MVars or equivalent structures. That means a long running transaction might get stampeded by many small ones invalidating it over and over. The long-running transaction might never clear if the small transactions keep moving the cheese. I mention this because transaction runtime and size and count all affect throughput and latency. What might be ideal for one pattern of work might not be ideal for another. Optimizing for overall throughput might make the contention and fairness problems worse too. I've done practical tests to optimize this in the past, both for STM in Haskell and for RDBMS workloads.

The next step is sometimes figuring out whether you really need a data structure within a single STM container or if perhaps you can break up your STM container boundaries into zones or regions that roughly map onto update boundaries. That should make the transactions churn less. On the outside chance you do need to touch more than one container in a transaction, well, they compose.

e.g. https://hackage.haskell.org/package/stm-containers https://hackage.haskell.org/package/stm-containers https://hackage.haskell.org/package/ttrie https://hackage.haskell.org/package/ttrie

It also sounds a bit like your question bumps into Amdahl's Law a bit.

All else fails, stop using STM and find something more tuned to your problem space.

Hope this helps, Chris Allen

On Thu, Jul 23, 2020 at 9:53 AM YueCompl via Haskell-Cafe mailto:haskell-cafe@haskell.org> wrote: Hello Cafe,

I'm working on an in-memory database, in Client/Server mode I just let each connected client submit remote procedure call running in its dedicated lightweight thread, modifying TVars in RAM per its business needs, then in case many clients connected concurrently and trying to insert new data, if they are triggering global index (some TVar) update, the throughput would drop drastically. I reduced the shared state to a simple int counter by TVar, got same symptom. While the parallelism feels okay when there's no hot TVar conflicting, or M is not much greater than N.

As an empirical test workload, I have a `+RTS -N10` server process, it handles 10 concurrent clients okay, got ~5x of single thread throughput; but in handling 20 concurrent clients, each of the 10 CPUs can only be driven to ~10% utilization, the throughput seems even worse than single thread. More clients can even drive it thrashing without much progressing.

I can understand that pure STM doesn't scale well after reading [1], and I see it suggested [7] attractive and planned future work toward that direction.

But I can't find certain libraries or frameworks addressing large M over small N scenarios, [1] experimented with designated N parallelism, and [7] is rather theoretical to my empirical needs.

Can you direct me to some available library implementing the methodology proposed in [7] or other ways tackling this problem?

I think the most difficult one is that a transaction should commit with global indices (with possibly unique constraints) atomically updated, and rollback with any violation of constraints, i.e. transactions have to cover global states like indices. Other problems seem more trivial than this.

Specifically, [7] states:

> It must be emphasized that all of the mechanisms we deploy originate, in one form or another, in the database literature from the 70s and 80s. Our contribution is to adapt these techniques to software transactional memory, providing more effective solutions to important STM problems than prior proposals.

I wonder any STM based library has simplified those techniques to be composed right away? I don't really want to implement those mechanisms by myself, rebuilding many wheels from scratch.

Best regards, Compl

[1] Comparing the performance of concurrent linked-list implementations in Haskell https://simonmar.github.io/bib/papers/concurrent-data.pdf https://simonmar.github.io/bib/papers/concurrent-data.pdf

[7] M. Herlihy and E. Koskinen. Transactional boosting: a methodology for highly-concurrent transactional objects. In Proc. of PPoPP ’08, pages 207–216. ACM Press, 2008. https://www.cs.stevens.edu/~ejk/papers/boosting-ppopp08.pdf https://www.cs.stevens.edu/~ejk/papers/boosting-ppopp08.pdf

_______________________________________________ Haskell-Cafe mailing list To (un)subscribe, modify options or view archives go to: http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe Only members subscribed via the mailman list are allowed to post.

-- Chris Allen Currently working on http://haskellbook.com http://haskellbook.com/_______________________________________________ Haskell-Cafe mailing list To (un)subscribe, modify options or view archives go to: http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe Only members subscribed via the mailman list are allowed to post.

_______________________________________________ Haskell-Cafe mailing list To (un)subscribe, modify options or view archives go to: http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe Only members subscribed via the mailman list are allowed to post.

Haskell-Cafe mailing list To (un)subscribe, modify options or view archives go to: http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe Only members subscribed via the mailman list are allowed to post. _______________________________________________ Haskell-Cafe mailing list To (un)subscribe, modify options or view archives go to: http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe Only members subscribed via the mailman list are allowed to post.

Ryan Yates

7:40 p.m.

New subject: STM friendly TreeMap (or similar with range scan api) ? WAS: Best ways to achieve throughput, for large M:N ratio of STM threads, with hot TVar updates?

Hi Compl, There is a lot of overhead with TVars. My thesis work addresses this by incorporating mutable constructor fields with STM. I would like to get all that into GHC as soon as I can :D. I haven't looked closely at the `tskiplist` package, I'll take a look and see if I see any potential issues. There was some recent work on concurrent B-tree that may be interesting to try. Ryan On Wed, Jul 29, 2020 at 10:24 AM YueCompl wrote:

...

Hi Cafe and Ryan,

I tried Map/Set from stm-containers and TSkipList (added range scan api against its internal data structure) from http://hackage.haskell.org/package/tskiplist , with them I've got quite improved at scalability on concurrency.

But unfortunately then I hit another wall at single thread scalability over working memory size, I suspect it's because massively more TVars (those being pointers per se) are introduced by those "contention-free" data structures, they need to mutate separate pointers concurrently in avoiding contentions anyway, but such pointer-intensive heap seems imposing extraordinary pressure to GHC's garbage collector, that GC will dominate CPU utilization with poor business progress.

For example in my test, I use `+RTS -H2g` for the Haskell server process, so GC is not triggered until after a while, then spin off 3 Python client to insert new records concurrently, in the first stage each Python process happily taking ~90% CPU filling (through local mmap) the arrays allocated from the server and logs of success scroll quickly, while the server process utilizes only 30~40% CPU to serve those 3 clients (insert meta data records into unique indices merely); then the client processes' CPU utilization drop drastically once Haskell server process' private memory reached around 2gb, i.e. GC started engaging, the server process's CPU utilization quickly approaches ~300%, while all client processes' drop to 0% for most of the time, and occasionally burst a tiny while with some log output showing progress. And I disable parallel GC lately, enabling parallel GC only makes it worse.

If I comment out the code updating the indices (those creating many TVars), the overall throughput only drop slowly as more data are inserted, the parallelism feels steady even after the server process' private memory takes several GBs.

I didn't expect this, but appears to me that GC of GHC is really not good at handling massive number of pointers in the heap, while those pointers are essential to reduce contention (and maybe expensive data copying too) at heavy parallelism/concurrency.

Btw I tried `+RTS -xn` with GHC 8.10.1 too, no obvious different behavior compared to 8.8.3; and also tried tweaking GC related RTS options a bit, including increasing -G up to 10, no much difference too.

I feel hopeless at the moment, wondering if I'll have to rewrite this in-memory db in Go/Rust or some other runtime ...

Btw I read https://tech.channable.com/posts/2020-04-07-lessons-in-managing-haskell-memo... in searching about the symptoms, and don't feel likely to convert my DB managed data into immutable types thus to fit into Compact Regions, not quite likely a live in-mem database instance can do.

So seems there are good reasons no successful DBMS, at least in-memory ones have been written in Haskell.

Best regards, Compl

On 2020-07-25, at 22:07, Ryan Yates wrote:

Unfortunately my STM benchmarks are rather disorganized. The most relevant paper using them is:

Leveraging hardware TM in Haskell (PPoPP '19) https://dl.acm.org/doi/10.1145/3293883.3295711

Or my thesis:

https://urresearch.rochester.edu/institutionalPublicationPublicView.action?i...

The PPoPP benchmarks are on a branch (or the releases tab on github):

https://github.com/fryguybob/ghc-stm-benchmarks/tree/wip/mutable-fields/benc...

All that to say, without an implementation of mutable constructor fields (which I'm working on getting into GHC) the scaling is limited.

Ryan

On Sat, Jul 25, 2020 at 3:45 AM Compl Yue via Haskell-Cafe < haskell-cafe@haskell.org> wrote:

...
Dear Cafe,

As Chris Allen has suggested, I learned that https://hackage.haskell.org/package/stm-containers and https://hackage.haskell.org/package/ttrie can help a lot when used in place of traditional HashMap for stm tx processing, under heavy concurrency, yet still with automatic parallelism as GHC implemented them. Then I realized that in addition to hash map (used to implement dicts and scopes), I also need to find a TreeMap replacement data structure to implement the db index. I've been focusing on the uniqueness constraint aspect, but it's still an index, needs to provide range scan api for db clients, so hash map is not sufficient for the index.

I see Ryan shared the code benchmarking RBTree with stm in mind:

https://github.com/fryguybob/ghc-stm-benchmarks/tree/master/benchmarks/RBTre...

https://github.com/fryguybob/ghc-stm-benchmarks/tree/master/benchmarks/RBTre...

But can't find conclusion or interpretation of that benchmark suite. And here's a followup question:

Where are some STM contention optimized data structures, that having keys ordered, with sub-range traversing api ?

(of course production ready libraries most desirable)

Thanks with regards,

Compl

On 2020/7/25 下午2:04, Compl Yue via Haskell-Cafe wrote:

Shame on me for I have neither experienced with `perf`, I'd learn these essential tools soon to put them into good use.

It's great to learn about how `orElse` actually works, I did get confused why there are so little retries captured, and now I know. So that little trick should definitely be removed before going production, as it does no much useful things at excessive cost. I put it there to help me understand internal working of stm, now I get even better knowledge ;-)

I think a debugger will trap every single abort, isn't it annoying when many aborts would occur? If I'd like to count the number of aborts, ideally accounted per service endpoints, time periods, source modules etc. there some tricks for that?

Thanks with best regards,

Compl

On 2020/7/25 上午2:02, Ryan Yates wrote:

To be clear, I was trying to refer to Linux `perf` [^1]. Sampling based profiling can do a good job with concurrent and parallel programs where other methods are problematic. For instance, changing the size of heap objects can drastically change cache performance and completely different behavior can show up.

[^1]: https://en.wikipedia.org/wiki/Perf_(Linux)

The spinning in `readTVar` should always be very short and it typically shows up as intensive CPU use, though it may not be high energy use with `pause` in the loop on x86 (looks like we don't have it [^2], I thought we did, but maybe that was only in some of my code... )

[^2]: https://github.com/ghc/ghc/blob/master/rts/STM.c#L1275

All that to say, I doubt that you are spending much time spinning (but it would certainly be interesting to know if you are! You would see `perf` attribute a large amount of time to `read_current_value`). The amount of code to execute for commit (the time when locks are held) is always much shorter than it takes to execute the transaction body. As you add more conflicting threads this gets worse of course as commits sequence.

The code you have will count commits of executions of `retry`. Note that `retry` is a user level idea, that is, you are counting user level *explicit* retries. This is different from a transaction failing to commit and starting again. These are invisible to the user. Also using your trace will convert `retry` from the efficient wake on write implementation, to an active retry that will always attempt again. We don't have cheap logging of transaction aborts in GHC, but I have built such logging in my work. You can observe these aborts with a debugger by looking for execution of this line:

https://github.com/ghc/ghc/blob/master/rts/STM.c#L1123

Ryan

On Fri, Jul 24, 2020 at 12:35 PM Compl Yue wrote:

...
I'm not familiar with profiling GHC yet, may need more time to get myself proficient with it.

And a bit more details of my test workload for diagnostic: the db clients are Python processes from a cluster of worker nodes, consulting the db server to register some path for data files, under a data dir within a shared filesystem, then mmap those data files and fill in actual array data. So the db server don't have much computation to perform, but puts the data file path into a global index, which at the same validates its uniqueness. As there are many client processes trying to insert one meta data record concurrently, with my naive implementation, the global index's TVar will almost always in locked state by one client after another, from a queue never fall empty.

So if `readTVar` should spinning waiting, I doubt the threads should actually make high CPU utilization, because at any instant of time, all threads except the committing one will be doing that one thing.

And I have something in my code to track STM retry like this:

``` -- blocking wait not expected, track stm retries explicitly trackSTM :: Int -> IO (Either () a) trackSTM !rtc = do when -- todo increase the threshold of reporting? (rtc > 0) $ do -- trace out the retries so the end users can be aware of them tid <- myThreadId trace ( "🔙\n" <> show callCtx <> "🌀 " <> show tid <> " stm retry #" <> show rtc ) $ return () atomically ((Just <$> stmJob) `orElse` return Nothing) >>= \case Nothing -> -- stm failed, do a tracked retry trackSTM (rtc + 1) Just ... -> ...

```

No such trace msg fires during my test, neither in single thread run, nor in runs with pressure. I'm sure this tracing mechanism works, as I can see such traces fire, in case e.g. posting a TMVar to a TQueue for some other thread to fill it, then read the result out, if these 2 ops are composed into a single tx, then of course it's infinite retry loop, and a sequence of such msgs are logged with ever increasing rtc #.

So I believe no retry has ever been triggered.

What can going on there?

On 2020/7/24 下午11:46, Ryan Yates wrote:

...
Then to explain the low CPU utilization (~10%), am I right to understand it as that upon reading a TVar locked by another committing tx, a lightweight thread will put itself into `waiting STM` and descheduled state, so the CPUs can only stay idle as not so many threads are willing to proceed?

Since the commit happens in finite steps, the expectation is that the lock will be released very soon. Given this when the body of a transaction executes `readTVar` it spins (active CPU!) until the `TVar` is observed unlocked. If a lock is observed while commiting, it immediately starts the transaction again from the beginning. To get the behavior of suspending a transaction you have to successfully commit a transaction that executed `retry`. Then the transaction is put on the wakeup lists of its read set and subsequent commits will wake it up if its write set overlaps.

I don't think any of these things would explain low CPU utilization. You could try running with `perf` and see if lots of time is spent in some recognizable part of the RTS.

Ryan

On Fri, Jul 24, 2020 at 11:22 AM Compl Yue wrote:

...
Thanks very much for the insightful information Ryan! I'm glad my suspect was wrong about the Haskell scheduler:

...
The Haskell capability that is committing a transaction will not yield to another Haskell thread while it is doing the commit. The OS thread may be preempted, but once commit starts the haskell scheduler is not invoked until after locks are released. So best effort had already been made in GHC and I just need to cooperate better with its design. Then to explain the low CPU utilization (~10%), am I right to understand it as that upon reading a TVar locked by another committing tx, a lightweight thread will put itself into `waiting STM` and descheduled state, so the CPUs can only stay idle as not so many threads are willing to proceed?

Anyway, I see light with better data structures to improve my situation, let me try them and report back. Actually I later changed `TVar (HaskMap k v)` to be `TVar (HashMap k Int)` where the `Int` being array index into `TVar (Vector (TVar (Maybe v)))`, in pursuing insertion order preservation semantic of dict entries (like that in Python 3.7+), then it's very hopeful to incorporate stm-containers' Map or ttrie to approach free of contention.

Thanks with regards,

Compl

On 2020/7/24 下午10:03, Ryan Yates wrote:

Hi Compl,

Having a pool of transaction processing threads can be helpful in a certain way. If the body of the transaction takes more time to execute then the Haskell thread is allowed and it yields, the suspended thread won't get in the way of other thread, but when it is rescheduled, will have a low probability of success. Even worse, it will probably not discover that it is doomed to failure until commit time. If transactions are more likely to reach commit without yielding, they are more likely to succeed. If the transactions are not conflicting, it doesn't make much difference other than cache churn.

The Haskell capability that is committing a transaction will not yield to another Haskell thread while it is doing the commit. The OS thread may be preempted, but once commit starts the haskell scheduler is not invoked until after locks are released.

To get good performance from STM you must pay attention to what TVars are involved in a commit. All STM systems are working under the assumption of low contention, so you want to minimize "false" conflicts (conflicts that are not essential to the computation). Something like `TVar (HashMap k v)` will work pretty well for a low thread count, but every transaction that writes to that structure will be in conflict with every other transaction that accesses it. Pushing the `TVar` into the nodes of the structure reduces the possibilities for conflict, while increasing the amount of bookkeeping STM has to do. I would like to reduce the cost of that bookkeeping using better structures, but we need to do so without harming performance in the low TVar count case. Right now it is optimized for good cache performance with a handful of TVars.

There is another way to play with performance by moving work into and out of the transaction body. A transaction body that executes quickly will reach commit faster. But it may be delaying work that moves into another transaction. Forcing values at the right time can make a big difference.

Ryan

On Fri, Jul 24, 2020 at 2:14 AM Compl Yue via Haskell-Cafe < haskell-cafe@haskell.org> wrote:

...
Thanks Chris, I confess I didn't pay enough attention to STM specialized container libraries by far, I skimmed through the description of stm-containers and ttrie, and feel they would definitely improve my code's performance in case I limit the server's parallelism within hardware capabilities. That may because I'm still prototyping the api and infrastructure for correctness, so even `TVar (HashMap k v)` performs okay for me at the moment, only if at low contention (surely there're plenty of CPU cycles to be optimized out in next steps). I model my data after graph model, so most data, even most indices are localized to nodes and edges, those can be manipulated without conflict, that's why I assumed I have a low contention use case since the very beginning - until I found there are still (though minor) needs for global indices to guarantee global uniqueness, I feel faithful with stm-containers/ttrie to implement a more scalable global index data structure, thanks for hinting me.

So an evident solution comes into my mind now, is to run the server with a pool of tx processing threads, matching number of CPU cores, client RPC requests then get queued to be executed in some thread from the pool. But I'm really fond of the mechanism of M:N scheduler which solves massive/dynamic concurrency so elegantly. I had some good result with Go in this regard, and see GHC at par in doing this, I don't want to give up this enjoyable machinery.

But looked at the stm implementation in GHC, it seems written TVars are exclusively locked during commit of a tx, I suspect this is the culprit when there're large M lightweight threads scheduled upon a small N hardware capabilities, that is when a lightweight thread yield control during an stm transaction commit, the TVars it locked will stay so until it's scheduled again (and again) till it can finish the commit. This way, descheduled threads could hold live threads from progressing. I haven't gone into more details there, but wonder if there can be some improvement for GHC RTS to keep an stm committing thread from descheduled, but seemingly that may impose more starvation potential; or stm can be improved to have its TVar locks preemptable when the owner trec/thread is in descheduled state? Neither should be easy but I'd really love massive lightweight threads doing STM practically well.

Best regards,

Compl

On 2020/7/24 上午12:57, Christopher Allen wrote:

It seems like you know how to run practical tests for tuning thread count and contention for throughput. Part of the reason you haven't gotten a super clear answer is "it depends." You give up fairness when you use STM instead of MVars or equivalent structures. That means a long running transaction might get stampeded by many small ones invalidating it over and over. The long-running transaction might never clear if the small transactions keep moving the cheese. I mention this because transaction runtime and size and count all affect throughput and latency. What might be ideal for one pattern of work might not be ideal for another. Optimizing for overall throughput might make the contention and fairness problems worse too. I've done practical tests to optimize this in the past, both for STM in Haskell and for RDBMS workloads.

The next step is sometimes figuring out whether you really need a data structure within a single STM container or if perhaps you can break up your STM container boundaries into zones or regions that roughly map onto update boundaries. That should make the transactions churn less. On the outside chance you do need to touch more than one container in a transaction, well, they compose.

e.g. https://hackage.haskell.org/package/stm-containers https://hackage.haskell.org/package/ttrie

It also sounds a bit like your question bumps into Amdahl's Law a bit.

All else fails, stop using STM and find something more tuned to your problem space.

Hope this helps, Chris Allen

On Thu, Jul 23, 2020 at 9:53 AM YueCompl via Haskell-Cafe < haskell-cafe@haskell.org> wrote:

...
Hello Cafe,

I'm working on an in-memory database, in Client/Server mode I just let each connected client submit remote procedure call running in its dedicated lightweight thread, modifying TVars in RAM per its business needs, then in case many clients connected concurrently and trying to insert new data, if they are triggering global index (some TVar) update, the throughput would drop drastically. I reduced the shared state to a simple int counter by TVar, got same symptom. While the parallelism feels okay when there's no hot TVar conflicting, or M is not much greater than N.

As an empirical test workload, I have a `+RTS -N10` server process, it handles 10 concurrent clients okay, got ~5x of single thread throughput; but in handling 20 concurrent clients, each of the 10 CPUs can only be driven to ~10% utilization, the throughput seems even worse than single thread. More clients can even drive it thrashing without much progressing.

I can understand that pure STM doesn't scale well after reading [1], and I see it suggested [7] attractive and planned future work toward that direction.

But I can't find certain libraries or frameworks addressing large M over small N scenarios, [1] experimented with designated N parallelism, and [7] is rather theoretical to my empirical needs.

Can you direct me to some available library implementing the methodology proposed in [7] or other ways tackling this problem?

I think the most difficult one is that a transaction should commit with global indices (with possibly unique constraints) atomically updated, and rollback with any violation of constraints, i.e. transactions have to cover global states like indices. Other problems seem more trivial than this.

Specifically, [7] states:

> It must be emphasized that all of the mechanisms we deploy originate, in one form or another, in the database literature from the 70s and 80s. Our contribution is to adapt these techniques to software transactional memory, providing more effective solutions to important STM problems than prior proposals.

I wonder any STM based library has simplified those techniques to be composed right away? I don't really want to implement those mechanisms by myself, rebuilding many wheels from scratch.

Best regards, Compl

[1] Comparing the performance of concurrent linked-list implementations in Haskell https://simonmar.github.io/bib/papers/concurrent-data.pdf

[7] M. Herlihy and E. Koskinen. Transactional boosting: a methodology for highly-concurrent transactional objects. In Proc. of PPoPP ’08, pages 207–216. ACM Press, 2008. https://www.cs.stevens.edu/~ejk/papers/boosting-ppopp08.pdf

_______________________________________________ Haskell-Cafe mailing list To (un)subscribe, modify options or view archives go to: http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe Only members subscribed via the mailman list are allowed to post.

-- Chris Allen Currently working on http://haskellbook.com

_______________________________________________ Haskell-Cafe mailing list To (un)subscribe, modify options or view archives go to: http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe Only members subscribed via the mailman list are allowed to post.

Haskell-Cafe mailing list To (un)subscribe, modify options or view archives go to:http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe Only members subscribed via the mailman list are allowed to post.

_______________________________________________ Haskell-Cafe mailing list To (un)subscribe, modify options or view archives go to: http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe Only members subscribed via the mailman list are allowed to post.

_______________________________________________ Haskell-Cafe mailing list To (un)subscribe, modify options or view archives go to: http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe Only members subscribed via the mailman list are allowed to post.

Simon Peyton Jones

8:57 p.m.

New subject: STM friendly TreeMap (or similar with range scan api) ? WAS: Best ways to achieve throughput, for large M:N ratio of STM threads, with hot TVar updates?

Compl’s problem is (apparently) that execution becomes dominated by GC. That doesn’t sound like a constant-factor overhead from TVars, no matter how efficient (or otherwise) they are. It sounds more like a space leak to me; perhaps you need some strict evaluation or something. My point is only: before re-engineering STM it would make sense to get a much more detailed insight into what is actually happening, and where the space and time is going. We have tools to do this (heap profiling, Threadscope, …) but I know they need some skill and insight to use well. But we don’t have nearly enough insight to draw meaningful conclusions yet. Maybe someone with experience of performance debugging might feel able to help Compl? Simon From: Haskell-Cafe On Behalf Of Ryan Yates Sent: 29 July 2020 20:41 To: YueCompl Cc: Haskell Cafe Subject: Re: [Haskell-cafe] STM friendly TreeMap (or similar with range scan api) ? WAS: Best ways to achieve throughput, for large M:N ratio of STM threads, with hot TVar updates? Hi Compl, There is a lot of overhead with TVars. My thesis work addresses this by incorporating mutable constructor fields with STM. I would like to get all that into GHC as soon as I can :D. I haven't looked closely at the `tskiplist` package, I'll take a look and see if I see any potential issues. There was some recent work on concurrent B-tree that may be interesting to try. Ryan On Wed, Jul 29, 2020 at 10:24 AM YueCompl mailto:compl.yue@icloud.com> wrote: Hi Cafe and Ryan, I tried Map/Set from stm-containers and TSkipList (added range scan api against its internal data structure) from http://hackage.haskell.org/package/tskiplist https://nam06.safelinks.protection.outlook.com/?url=http%3A%2F%2Fhackage.haskell.org%2Fpackage%2Ftskiplist&data=02%7C01%7Csimonpj%40microsoft.com%7C8ebd68bca55140cebaae08d833f888f2%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637316489838761589&sdata=ZOvJVBqJgdGqx2k%2F49fhZeTYkWAd4GRY%2B8ZxH7cyEkI%3D&reserved=0 , with them I've got quite improved at scalability on concurrency. But unfortunately then I hit another wall at single thread scalability over working memory size, I suspect it's because massively more TVars (those being pointers per se) are introduced by those "contention-free" data structures, they need to mutate separate pointers concurrently in avoiding contentions anyway, but such pointer-intensive heap seems imposing extraordinary pressure to GHC's garbage collector, that GC will dominate CPU utilization with poor business progress. For example in my test, I use `+RTS -H2g` for the Haskell server process, so GC is not triggered until after a while, then spin off 3 Python client to insert new records concurrently, in the first stage each Python process happily taking ~90% CPU filling (through local mmap) the arrays allocated from the server and logs of success scroll quickly, while the server process utilizes only 30~40% CPU to serve those 3 clients (insert meta data records into unique indices merely); then the client processes' CPU utilization drop drastically once Haskell server process' private memory reached around 2gb, i.e. GC started engaging, the server process's CPU utilization quickly approaches ~300%, while all client processes' drop to 0% for most of the time, and occasionally burst a tiny while with some log output showing progress. And I disable parallel GC lately, enabling parallel GC only makes it worse. If I comment out the code updating the indices (those creating many TVars), the overall throughput only drop slowly as more data are inserted, the parallelism feels steady even after the server process' private memory takes several GBs. I didn't expect this, but appears to me that GC of GHC is really not good at handling massive number of pointers in the heap, while those pointers are essential to reduce contention (and maybe expensive data copying too) at heavy parallelism/concurrency. Btw I tried `+RTS -xn` with GHC 8.10.1 too, no obvious different behavior compared to 8.8.3; and also tried tweaking GC related RTS options a bit, including increasing -G up to 10, no much difference too. I feel hopeless at the moment, wondering if I'll have to rewrite this in-memory db in Go/Rust or some other runtime ... Btw I read https://tech.channable.com/posts/2020-04-07-lessons-in-managing-haskell-memo...https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Ftech.channable.com%2Fposts%2F2020-04-07-lessons-in-managing-haskell-memory.html&data=02%7C01%7Csimonpj%40microsoft.com%7C8ebd68bca55140cebaae08d833f888f2%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637316489838761589&sdata=gqSH82%2FOYRaW4fzBDl%2BLDjhbRA%2BDRE6jaj4k1UI2gFE%3D&reserved=0 in searching about the symptoms, and don't feel likely to convert my DB managed data into immutable types thus to fit into Compact Regions, not quite likely a live in-mem database instance can do. So seems there are good reasons no successful DBMS, at least in-memory ones have been written in Haskell. Best regards, Compl On 2020-07-25, at 22:07, Ryan Yates mailto:fryguybob@gmail.com> wrote: Unfortunately my STM benchmarks are rather disorganized. The most relevant paper using them is: Leveraging hardware TM in Haskell (PPoPP '19) https://dl.acm.org/doi/10.1145/3293883.3295711 https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdl.acm.org%2Fdoi%2F10.1145%2F3293883.3295711&data=02%7C01%7Csimonpj%40microsoft.com%7C8ebd68bca55140cebaae08d833f888f2%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637316489838771582&sdata=h3po1gPutR%2BsiCST1N0RNkM6irnVL0%2BVbYl3Vs8F8Oc%3D&reserved=0 Or my thesis: https://urresearch.rochester.edu/institutionalPublicationPublicView.action?i...https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Furresearch.rochester.edu%2FinstitutionalPublicationPublicView.action%3FinstitutionalItemId%3D34931&data=02%7C01%7Csimonpj%40microsoft.com%7C8ebd68bca55140cebaae08d833f888f2%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637316489838771582&sdata=jBQMX5RRajIj0KbLWQCMt%2BMyMJIEmTpSuEHBWpq5Isg%3D&reserved=0 The PPoPP benchmarks are on a branch (or the releases tab on github): https://github.com/fryguybob/ghc-stm-benchmarks/tree/wip/mutable-fields/benc...https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Ffryguybob%2Fghc-stm-benchmarks%2Ftree%2Fwip%2Fmutable-fields%2Fbenchmarks%2FPPoPP2019%2Fsrc&data=02%7C01%7Csimonpj%40microsoft.com%7C8ebd68bca55140cebaae08d833f888f2%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637316489838771582&sdata=PinsrrGPgAB9TgxH61xngSItw1DcIRf1Niq39b%2BOe0s%3D&reserved=0 All that to say, without an implementation of mutable constructor fields (which I'm working on getting into GHC) the scaling is limited. Ryan On Sat, Jul 25, 2020 at 3:45 AM Compl Yue via Haskell-Cafe mailto:haskell-cafe@haskell.org> wrote: Dear Cafe, As Chris Allen has suggested, I learned that https://hackage.haskell.org/package/stm-containers https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fhackage.haskell.org%2Fpackage%2Fstm-containers&data=02%7C01%7Csimonpj%40microsoft.com%7C8ebd68bca55140cebaae08d833f888f2%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637316489838781576&sdata=ZwtAltlFRkny5q7M%2B7Pople6c4WA%2Bs8vZhwewUge7eg%3D&reserved=0 and https://hackage.haskell.org/package/ttrie https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fhackage.haskell.org%2Fpackage%2Fttrie&data=02%7C01%7Csimonpj%40microsoft.com%7C8ebd68bca55140cebaae08d833f888f2%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637316489838781576&sdata=zMcZy%2BEzqklkQGjKglCgwg5ZoWyWZIyeRNaCcqtnECs%3D&reserved=0 can help a lot when used in place of traditional HashMap for stm tx processing, under heavy concurrency, yet still with automatic parallelism as GHC implemented them. Then I realized that in addition to hash map (used to implement dicts and scopes), I also need to find a TreeMap replacement data structure to implement the db index. I've been focusing on the uniqueness constraint aspect, but it's still an index, needs to provide range scan api for db clients, so hash map is not sufficient for the index. I see Ryan shared the code benchmarking RBTree with stm in mind: https://github.com/fryguybob/ghc-stm-benchmarks/tree/master/benchmarks/RBTre...https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Ffryguybob%2Fghc-stm-benchmarks%2Ftree%2Fmaster%2Fbenchmarks%2FRBTree-Throughput&data=02%7C01%7Csimonpj%40microsoft.com%7C8ebd68bca55140cebaae08d833f888f2%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637316489838791571&sdata=Nl2eN81Kjaf5qyNKEaxxc0ioMw6w4QoX4b5vAE5RaF8%3D&reserved=0 https://github.com/fryguybob/ghc-stm-benchmarks/tree/master/benchmarks/RBTre...https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Ffryguybob%2Fghc-stm-benchmarks%2Ftree%2Fmaster%2Fbenchmarks%2FRBTree&data=02%7C01%7Csimonpj%40microsoft.com%7C8ebd68bca55140cebaae08d833f888f2%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637316489838791571&sdata=%2BLp6HQCyROOlpA2pr8BR8DPls68oY5Y77GKgqbSKmno%3D&reserved=0 But can't find conclusion or interpretation of that benchmark suite. And here's a followup question: Where are some STM contention optimized data structures, that having keys ordered, with sub-range traversing api ? (of course production ready libraries most desirable) Thanks with regards, Compl On 2020/7/25 下午2:04, Compl Yue via Haskell-Cafe wrote: Shame on me for I have neither experienced with `perf`, I'd learn these essential tools soon to put them into good use. It's great to learn about how `orElse` actually works, I did get confused why there are so little retries captured, and now I know. So that little trick should definitely be removed before going production, as it does no much useful things at excessive cost. I put it there to help me understand internal working of stm, now I get even better knowledge ;-) I think a debugger will trap every single abort, isn't it annoying when many aborts would occur? If I'd like to count the number of aborts, ideally accounted per service endpoints, time periods, source modules etc. there some tricks for that? Thanks with best regards, Compl On 2020/7/25 上午2:02, Ryan Yates wrote: To be clear, I was trying to refer to Linux `perf` [^1]. Sampling based profiling can do a good job with concurrent and parallel programs where other methods are problematic. For instance, changing the size of heap objects can drastically change cache performance and completely different behavior can show up. [^1]: https://en.wikipedia.org/wiki/Perf_(Linux)https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fen.wikipedia.org%2Fwiki%2FPerf_(Linux)&data=02%7C01%7Csimonpj%40microsoft.com%7C8ebd68bca55140cebaae08d833f888f2%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637316489838801566&sdata=v%2Bv2aVaBITriAM26CqN%2Bp35yshLl%2BbY4BWVEIOSlStA%3D&reserved=0 The spinning in `readTVar` should always be very short and it typically shows up as intensive CPU use, though it may not be high energy use with `pause` in the loop on x86 (looks like we don't have it [^2], I thought we did, but maybe that was only in some of my code... ) [^2]: https://github.com/ghc/ghc/blob/master/rts/STM.c#L1275 https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fghc%2Fghc%2Fblob%2Fmaster%2Frts%2FSTM.c%23L1275&data=02%7C01%7Csimonpj%40microsoft.com%7C8ebd68bca55140cebaae08d833f888f2%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637316489838801566&sdata=YBLmeg4Xxby%2BJJmO8B5etdA6tDpBYOry7jdjEoRFd%2Fk%3D&reserved=0 All that to say, I doubt that you are spending much time spinning (but it would certainly be interesting to know if you are! You would see `perf` attribute a large amount of time to `read_current_value`). The amount of code to execute for commit (the time when locks are held) is always much shorter than it takes to execute the transaction body. As you add more conflicting threads this gets worse of course as commits sequence. The code you have will count commits of executions of `retry`. Note that `retry` is a user level idea, that is, you are counting user level *explicit* retries. This is different from a transaction failing to commit and starting again. These are invisible to the user. Also using your trace will convert `retry` from the efficient wake on write implementation, to an active retry that will always attempt again. We don't have cheap logging of transaction aborts in GHC, but I have built such logging in my work. You can observe these aborts with a debugger by looking for execution of this line: https://github.com/ghc/ghc/blob/master/rts/STM.c#L1123 https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fghc%2Fghc%2Fblob%2Fmaster%2Frts%2FSTM.c%23L1123&data=02%7C01%7Csimonpj%40microsoft.com%7C8ebd68bca55140cebaae08d833f888f2%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637316489838811560&sdata=jAEm1CpEYQx6ORikerxVHOSlaOmrTzB3m9EVmOwo%2B8w%3D&reserved=0 Ryan On Fri, Jul 24, 2020 at 12:35 PM Compl Yue mailto:compl.yue@icloud.com> wrote: I'm not familiar with profiling GHC yet, may need more time to get myself proficient with it. And a bit more details of my test workload for diagnostic: the db clients are Python processes from a cluster of worker nodes, consulting the db server to register some path for data files, under a data dir within a shared filesystem, then mmap those data files and fill in actual array data. So the db server don't have much computation to perform, but puts the data file path into a global index, which at the same validates its uniqueness. As there are many client processes trying to insert one meta data record concurrently, with my naive implementation, the global index's TVar will almost always in locked state by one client after another, from a queue never fall empty. So if `readTVar` should spinning waiting, I doubt the threads should actually make high CPU utilization, because at any instant of time, all threads except the committing one will be doing that one thing. And I have something in my code to track STM retry like this: ``` -- blocking wait not expected, track stm retries explicitly trackSTM :: Int -> IO (Either () a) trackSTM !rtc = do when -- todo increase the threshold of reporting? (rtc > 0) $ do -- trace out the retries so the end users can be aware of them tid <- myThreadId trace ( "🔙\n" <> show callCtx <> "🌀 " <> show tid <> " stm retry #" <> show rtc ) $ return () atomically ((Just <$> stmJob) `orElse` return Nothing) >>= \case Nothing -> -- stm failed, do a tracked retry trackSTM (rtc + 1) Just ... -> ... ``` No such trace msg fires during my test, neither in single thread run, nor in runs with pressure. I'm sure this tracing mechanism works, as I can see such traces fire, in case e.g. posting a TMVar to a TQueue for some other thread to fill it, then read the result out, if these 2 ops are composed into a single tx, then of course it's infinite retry loop, and a sequence of such msgs are logged with ever increasing rtc #. So I believe no retry has ever been triggered. What can going on there? On 2020/7/24 下午11:46, Ryan Yates wrote:

...

Then to explain the low CPU utilization (~10%), am I right to understand it as that upon reading a TVar locked by another committing tx, a lightweight thread will put itself into `waiting STM` and descheduled state, so the CPUs can only stay idle as not so many threads are willing to proceed?

...

The Haskell capability that is committing a transaction will not yield to another Haskell thread while it is doing the commit. The OS thread may be preempted, but once commit starts the haskell scheduler is not invoked until after locks are released. So best effort had already been made in GHC and I just need to cooperate better with its design. Then to explain the low CPU utilization (~10%), am I right to understand it as that upon reading a TVar locked by another committing tx, a lightweight thread will put itself into `waiting STM` and descheduled state, so the CPUs can only stay idle as not so many threads are willing to proceed?

Anyway, I see light with better data structures to improve my situation, let me try them and report back. Actually I later changed `TVar (HaskMap k v)` to be `TVar (HashMap k Int)` where the `Int` being array index into `TVar (Vector (TVar (Maybe v)))`, in pursuing insertion order preservation semantic of dict entries (like that in Python 3.7+), then it's very hopeful to incorporate stm-containers' Map or ttrie to approach free of contention. Thanks with regards, Compl On 2020/7/24 下午10:03, Ryan Yates wrote: Hi Compl, Having a pool of transaction processing threads can be helpful in a certain way. If the body of the transaction takes more time to execute then the Haskell thread is allowed and it yields, the suspended thread won't get in the way of other thread, but when it is rescheduled, will have a low probability of success. Even worse, it will probably not discover that it is doomed to failure until commit time. If transactions are more likely to reach commit without yielding, they are more likely to succeed. If the transactions are not conflicting, it doesn't make much difference other than cache churn. The Haskell capability that is committing a transaction will not yield to another Haskell thread while it is doing the commit. The OS thread may be preempted, but once commit starts the haskell scheduler is not invoked until after locks are released. To get good performance from STM you must pay attention to what TVars are involved in a commit. All STM systems are working under the assumption of low contention, so you want to minimize "false" conflicts (conflicts that are not essential to the computation). Something like `TVar (HashMap k v)` will work pretty well for a low thread count, but every transaction that writes to that structure will be in conflict with every other transaction that accesses it. Pushing the `TVar` into the nodes of the structure reduces the possibilities for conflict, while increasing the amount of bookkeeping STM has to do. I would like to reduce the cost of that bookkeeping using better structures, but we need to do so without harming performance in the low TVar count case. Right now it is optimized for good cache performance with a handful of TVars. There is another way to play with performance by moving work into and out of the transaction body. A transaction body that executes quickly will reach commit faster. But it may be delaying work that moves into another transaction. Forcing values at the right time can make a big difference. Ryan On Fri, Jul 24, 2020 at 2:14 AM Compl Yue via Haskell-Cafe mailto:haskell-cafe@haskell.org> wrote: Thanks Chris, I confess I didn't pay enough attention to STM specialized container libraries by far, I skimmed through the description of stm-containers and ttrie, and feel they would definitely improve my code's performance in case I limit the server's parallelism within hardware capabilities. That may because I'm still prototyping the api and infrastructure for correctness, so even `TVar (HashMap k v)` performs okay for me at the moment, only if at low contention (surely there're plenty of CPU cycles to be optimized out in next steps). I model my data after graph model, so most data, even most indices are localized to nodes and edges, those can be manipulated without conflict, that's why I assumed I have a low contention use case since the very beginning - until I found there are still (though minor) needs for global indices to guarantee global uniqueness, I feel faithful with stm-containers/ttrie to implement a more scalable global index data structure, thanks for hinting me. So an evident solution comes into my mind now, is to run the server with a pool of tx processing threads, matching number of CPU cores, client RPC requests then get queued to be executed in some thread from the pool. But I'm really fond of the mechanism of M:N scheduler which solves massive/dynamic concurrency so elegantly. I had some good result with Go in this regard, and see GHC at par in doing this, I don't want to give up this enjoyable machinery. But looked at the stm implementation in GHC, it seems written TVars are exclusively locked during commit of a tx, I suspect this is the culprit when there're large M lightweight threads scheduled upon a small N hardware capabilities, that is when a lightweight thread yield control during an stm transaction commit, the TVars it locked will stay so until it's scheduled again (and again) till it can finish the commit. This way, descheduled threads could hold live threads from progressing. I haven't gone into more details there, but wonder if there can be some improvement for GHC RTS to keep an stm committing thread from descheduled, but seemingly that may impose more starvation potential; or stm can be improved to have its TVar locks preemptable when the owner trec/thread is in descheduled state? Neither should be easy but I'd really love massive lightweight threads doing STM practically well. Best regards, Compl On 2020/7/24 上午12:57, Christopher Allen wrote: It seems like you know how to run practical tests for tuning thread count and contention for throughput. Part of the reason you haven't gotten a super clear answer is "it depends." You give up fairness when you use STM instead of MVars or equivalent structures. That means a long running transaction might get stampeded by many small ones invalidating it over and over. The long-running transaction might never clear if the small transactions keep moving the cheese. I mention this because transaction runtime and size and count all affect throughput and latency. What might be ideal for one pattern of work might not be ideal for another. Optimizing for overall throughput might make the contention and fairness problems worse too. I've done practical tests to optimize this in the past, both for STM in Haskell and for RDBMS workloads. The next step is sometimes figuring out whether you really need a data structure within a single STM container or if perhaps you can break up your STM container boundaries into zones or regions that roughly map onto update boundaries. That should make the transactions churn less. On the outside chance you do need to touch more than one container in a transaction, well, they compose. e.g. https://hackage.haskell.org/package/stm-containers https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fhackage.haskell.org%2Fpackage%2Fstm-containers&data=02%7C01%7Csimonpj%40microsoft.com%7C8ebd68bca55140cebaae08d833f888f2%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637316489838811560&sdata=Lq1%2BGj0Z6%2BBGMRAZrSzcTAlYgj0B0A67RaQcyyCcXbk%3D&reserved=0 https://hackage.haskell.org/package/ttrie https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fhackage.haskell.org%2Fpackage%2Fttrie&data=02%7C01%7Csimonpj%40microsoft.com%7C8ebd68bca55140cebaae08d833f888f2%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637316489838821555&sdata=PpaiVM2NrPM2HzK0bh%2BMR8YF90yHlxKnN9gwZVQHqR0%3D&reserved=0 It also sounds a bit like your question bumps into Amdahl's Law a bit. All else fails, stop using STM and find something more tuned to your problem space. Hope this helps, Chris Allen On Thu, Jul 23, 2020 at 9:53 AM YueCompl via Haskell-Cafe mailto:haskell-cafe@haskell.org> wrote: Hello Cafe, I'm working on an in-memory database, in Client/Server mode I just let each connected client submit remote procedure call running in its dedicated lightweight thread, modifying TVars in RAM per its business needs, then in case many clients connected concurrently and trying to insert new data, if they are triggering global index (some TVar) update, the throughput would drop drastically. I reduced the shared state to a simple int counter by TVar, got same symptom. While the parallelism feels okay when there's no hot TVar conflicting, or M is not much greater than N. As an empirical test workload, I have a `+RTS -N10` server process, it handles 10 concurrent clients okay, got ~5x of single thread throughput; but in handling 20 concurrent clients, each of the 10 CPUs can only be driven to ~10% utilization, the throughput seems even worse than single thread. More clients can even drive it thrashing without much progressing. I can understand that pure STM doesn't scale well after reading [1], and I see it suggested [7] attractive and planned future work toward that direction. But I can't find certain libraries or frameworks addressing large M over small N scenarios, [1] experimented with designated N parallelism, and [7] is rather theoretical to my empirical needs. Can you direct me to some available library implementing the methodology proposed in [7] or other ways tackling this problem? I think the most difficult one is that a transaction should commit with global indices (with possibly unique constraints) atomically updated, and rollback with any violation of constraints, i.e. transactions have to cover global states like indices. Other problems seem more trivial than this. Specifically, [7] states:

...

It must be emphasized that all of the mechanisms we deploy originate, in one form or another, in the database literature from the 70s and 80s. Our contribution is to adapt these techniques to software transactional memory, providing more effective solutions to important STM problems than prior proposals.

Ryan Yates

30 Jul 30 Jul

2:05 a.m.

New subject: STM friendly TreeMap (or similar with range scan api) ? WAS: Best ways to achieve throughput, for large M:N ratio of STM threads, with hot TVar updates?

Simon, I certainly want to help get to the bottom of the performance issue at hand :D. Sorry if my reply was misleading. The constant factor overhead of pushing `TVar`s into the internal structure may be pressuring unacceptable GC behavior to happen sooner. My impression was that given the same size problem performance loss shifted from synchronization to GC. Compl, I'm not aware of mutable heap objects being problematic in particular for GHC's GC. There are lots of special cases to handle them of course. I have successfully written Haskell programs that get good performance from the GC with the dominant fraction of heap objects being mutable. I looked a little more at `TSkipList` and one tricky aspect of an STM based skip list is how to manage randomness. In `TSkipList`'s code there is the following comment: -- | Returns a randomly chosen level. Used for inserting new elements. /O(1)./-- For performance reasons, this function uses 'unsafePerformIO' to access the-- random number generator. (It would be possible to store the random number-- generator in a 'TVar' and thus be able to access it safely from within the-- STM monad. This, however, might cause high contention among threads.) chooseLevel :: TSkipList http://hackage.haskell.org/package/tskiplist-1.0.1/docs/src/Control.Concurre... k http://hackage.haskell.org/package/tskiplist-1.0.1/docs/src/Control.Concurre... a http://hackage.haskell.org/package/tskiplist-1.0.1/docs/src/Control.Concurre... -> Int This level is chosen on insertion to determine the height of the node. When writing my own STM skiplist I found that the details in unsafely accessing randomness had a significant impact on performance. We went with an unboxed array of PCG states that had an entry for each capability giving constant memory overhead in the number of capabilities. `TSkipList` uses `newStdGen` which involves allocation and synchronization. Again, I'm not pointing this out to say that this is the entirety of the issue you are encountering, rather, I do think the `TSkipList` library could be improved to allocate much less. Others can speak to how to tell where the time is going in GC (my knowledge of this is likely out of date). Ryan On Wed, Jul 29, 2020 at 4:57 PM Simon Peyton Jones wrote:

...

Compl’s problem is (apparently) that execution becomes dominated by GC. That doesn’t sound like a constant-factor overhead from TVars, no matter how efficient (or otherwise) they are. It sounds more like a space leak to me; perhaps you need some strict evaluation or something.

My point is only: before re-engineering STM it would make sense to get a much more detailed insight into what is actually happening, and where the space and time is going. We have tools to do this (heap profiling, Threadscope, …) but I know they need some skill and insight to use well. But we don’t have nearly enough insight to draw meaningful conclusions yet.

Maybe someone with experience of performance debugging might feel able to help Compl?

Simon

*From:* Haskell-Cafe *On Behalf Of *Ryan Yates *Sent:* 29 July 2020 20:41 *To:* YueCompl *Cc:* Haskell Cafe *Subject:* Re: [Haskell-cafe] STM friendly TreeMap (or similar with range scan api) ? WAS: Best ways to achieve throughput, for large M:N ratio of STM threads, with hot TVar updates?

Hi Compl,

There is a lot of overhead with TVars. My thesis work addresses this by incorporating mutable constructor fields with STM. I would like to get all that into GHC as soon as I can :D. I haven't looked closely at the `tskiplist` package, I'll take a look and see if I see any potential issues. There was some recent work on concurrent B-tree that may be interesting to try.

Ryan

On Wed, Jul 29, 2020 at 10:24 AM YueCompl wrote:

Hi Cafe and Ryan,

I tried Map/Set from stm-containers and TSkipList (added range scan api against its internal data structure) from http://hackage.haskell.org/package/tskiplist https://nam06.safelinks.protection.outlook.com/?url=http%3A%2F%2Fhackage.haskell.org%2Fpackage%2Ftskiplist&data=02%7C01%7Csimonpj%40microsoft.com%7C8ebd68bca55140cebaae08d833f888f2%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637316489838761589&sdata=ZOvJVBqJgdGqx2k%2F49fhZeTYkWAd4GRY%2B8ZxH7cyEkI%3D&reserved=0 , with them I've got quite improved at scalability on concurrency.

But unfortunately then I hit another wall at single thread scalability over working memory size, I suspect it's because massively more TVars (those being pointers per se) are introduced by those "contention-free" data structures, they need to mutate separate pointers concurrently in avoiding contentions anyway, but such pointer-intensive heap seems imposing extraordinary pressure to GHC's garbage collector, that GC will dominate CPU utilization with poor business progress.

For example in my test, I use `+RTS -H2g` for the Haskell server process, so GC is not triggered until after a while, then spin off 3 Python client to insert new records concurrently, in the first stage each Python process happily taking ~90% CPU filling (through local mmap) the arrays allocated from the server and logs of success scroll quickly, while the server process utilizes only 30~40% CPU to serve those 3 clients (insert meta data records into unique indices merely); then the client processes' CPU utilization drop drastically once Haskell server process' private memory reached around 2gb, i.e. GC started engaging, the server process's CPU utilization quickly approaches ~300%, while all client processes' drop to 0% for most of the time, and occasionally burst a tiny while with some log output showing progress. And I disable parallel GC lately, enabling parallel GC only makes it worse.

If I comment out the code updating the indices (those creating many TVars), the overall throughput only drop slowly as more data are inserted, the parallelism feels steady even after the server process' private memory takes several GBs.

I didn't expect this, but appears to me that GC of GHC is really not good at handling massive number of pointers in the heap, while those pointers are essential to reduce contention (and maybe expensive data copying too) at heavy parallelism/concurrency.

Btw I tried `+RTS -xn` with GHC 8.10.1 too, no obvious different behavior compared to 8.8.3; and also tried tweaking GC related RTS options a bit, including increasing -G up to 10, no much difference too.

I feel hopeless at the moment, wondering if I'll have to rewrite this in-memory db in Go/Rust or some other runtime ...

Btw I read https://tech.channable.com/posts/2020-04-07-lessons-in-managing-haskell-memo... https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Ftech.channable.com%2Fposts%2F2020-04-07-lessons-in-managing-haskell-memory.html&data=02%7C01%7Csimonpj%40microsoft.com%7C8ebd68bca55140cebaae08d833f888f2%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637316489838761589&sdata=gqSH82%2FOYRaW4fzBDl%2BLDjhbRA%2BDRE6jaj4k1UI2gFE%3D&reserved=0 in searching about the symptoms, and don't feel likely to convert my DB managed data into immutable types thus to fit into Compact Regions, not quite likely a live in-mem database instance can do.

So seems there are good reasons no successful DBMS, at least in-memory ones have been written in Haskell.

Best regards,

Compl

On 2020-07-25, at 22:07, Ryan Yates wrote:

Unfortunately my STM benchmarks are rather disorganized. The most relevant paper using them is:

Leveraging hardware TM in Haskell (PPoPP '19)

https://dl.acm.org/doi/10.1145/3293883.3295711 https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdl.acm.org%2Fdoi%2F10.1145%2F3293883.3295711&data=02%7C01%7Csimonpj%40microsoft.com%7C8ebd68bca55140cebaae08d833f888f2%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637316489838771582&sdata=h3po1gPutR%2BsiCST1N0RNkM6irnVL0%2BVbYl3Vs8F8Oc%3D&reserved=0

Or my thesis:

https://urresearch.rochester.edu/institutionalPublicationPublicView.action?i... https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Furresearch.rochester.edu%2FinstitutionalPublicationPublicView.action%3FinstitutionalItemId%3D34931&data=02%7C01%7Csimonpj%40microsoft.com%7C8ebd68bca55140cebaae08d833f888f2%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637316489838771582&sdata=jBQMX5RRajIj0KbLWQCMt%2BMyMJIEmTpSuEHBWpq5Isg%3D&reserved=0

The PPoPP benchmarks are on a branch (or the releases tab on github):

https://github.com/fryguybob/ghc-stm-benchmarks/tree/wip/mutable-fields/benc... https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Ffryguybob%2Fghc-stm-benchmarks%2Ftree%2Fwip%2Fmutable-fields%2Fbenchmarks%2FPPoPP2019%2Fsrc&data=02%7C01%7Csimonpj%40microsoft.com%7C8ebd68bca55140cebaae08d833f888f2%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637316489838771582&sdata=PinsrrGPgAB9TgxH61xngSItw1DcIRf1Niq39b%2BOe0s%3D&reserved=0

All that to say, without an implementation of mutable constructor fields (which I'm working on getting into GHC) the scaling is limited.

Ryan

On Sat, Jul 25, 2020 at 3:45 AM Compl Yue via Haskell-Cafe < haskell-cafe@haskell.org> wrote:

Dear Cafe,

As Chris Allen has suggested, I learned that https://hackage.haskell.org/package/stm-containers https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fhackage.haskell.org%2Fpackage%2Fstm-containers&data=02%7C01%7Csimonpj%40microsoft.com%7C8ebd68bca55140cebaae08d833f888f2%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637316489838781576&sdata=ZwtAltlFRkny5q7M%2B7Pople6c4WA%2Bs8vZhwewUge7eg%3D&reserved=0 and https://hackage.haskell.org/package/ttrie https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fhackage.haskell.org%2Fpackage%2Fttrie&data=02%7C01%7Csimonpj%40microsoft.com%7C8ebd68bca55140cebaae08d833f888f2%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637316489838781576&sdata=zMcZy%2BEzqklkQGjKglCgwg5ZoWyWZIyeRNaCcqtnECs%3D&reserved=0 can help a lot when used in place of traditional HashMap for stm tx processing, under heavy concurrency, yet still with automatic parallelism as GHC implemented them. Then I realized that in addition to hash map (used to implement dicts and scopes), I also need to find a TreeMap replacement data structure to implement the db index. I've been focusing on the uniqueness constraint aspect, but it's still an index, needs to provide range scan api for db clients, so hash map is not sufficient for the index.

I see Ryan shared the code benchmarking RBTree with stm in mind:

https://github.com/fryguybob/ghc-stm-benchmarks/tree/master/benchmarks/RBTre... https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Ffryguybob%2Fghc-stm-benchmarks%2Ftree%2Fmaster%2Fbenchmarks%2FRBTree-Throughput&data=02%7C01%7Csimonpj%40microsoft.com%7C8ebd68bca55140cebaae08d833f888f2%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637316489838791571&sdata=Nl2eN81Kjaf5qyNKEaxxc0ioMw6w4QoX4b5vAE5RaF8%3D&reserved=0

https://github.com/fryguybob/ghc-stm-benchmarks/tree/master/benchmarks/RBTre... https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Ffryguybob%2Fghc-stm-benchmarks%2Ftree%2Fmaster%2Fbenchmarks%2FRBTree&data=02%7C01%7Csimonpj%40microsoft.com%7C8ebd68bca55140cebaae08d833f888f2%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637316489838791571&sdata=%2BLp6HQCyROOlpA2pr8BR8DPls68oY5Y77GKgqbSKmno%3D&reserved=0

But can't find conclusion or interpretation of that benchmark suite. And here's a followup question:

Where are some STM contention optimized data structures, that having keys ordered, with sub-range traversing api ?

(of course production ready libraries most desirable)

Thanks with regards,

Compl

On 2020/7/25 下午2:04, Compl Yue via Haskell-Cafe wrote:

Shame on me for I have neither experienced with `perf`, I'd learn these essential tools soon to put them into good use.

It's great to learn about how `orElse` actually works, I did get confused why there are so little retries captured, and now I know. So that little trick should definitely be removed before going production, as it does no much useful things at excessive cost. I put it there to help me understand internal working of stm, now I get even better knowledge ;-)

I think a debugger will trap every single abort, isn't it annoying when many aborts would occur? If I'd like to count the number of aborts, ideally accounted per service endpoints, time periods, source modules etc. there some tricks for that?

Thanks with best regards,

Compl

On 2020/7/25 上午2:02, Ryan Yates wrote:

To be clear, I was trying to refer to Linux `perf` [^1]. Sampling based profiling can do a good job with concurrent and parallel programs where other methods are problematic. For instance,

changing the size of heap objects can drastically change cache performance and completely different behavior can show up.

[^1]: https://en.wikipedia.org/wiki/Perf_(Linux) https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fen.wikipedia.org%2Fwiki%2FPerf_(Linux)&data=02%7C01%7Csimonpj%40microsoft.com%7C8ebd68bca55140cebaae08d833f888f2%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637316489838801566&sdata=v%2Bv2aVaBITriAM26CqN%2Bp35yshLl%2BbY4BWVEIOSlStA%3D&reserved=0

The spinning in `readTVar` should always be very short and it typically shows up as intensive CPU use, though it may not be high energy use with `pause` in the loop on x86 (looks like we don't have it [^2], I thought we did, but maybe that was only in some of my code... )

[^2]: https://github.com/ghc/ghc/blob/master/rts/STM.c#L1275 https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fghc%2Fghc%2Fblob%2Fmaster%2Frts%2FSTM.c%23L1275&data=02%7C01%7Csimonpj%40microsoft.com%7C8ebd68bca55140cebaae08d833f888f2%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637316489838801566&sdata=YBLmeg4Xxby%2BJJmO8B5etdA6tDpBYOry7jdjEoRFd%2Fk%3D&reserved=0

All that to say, I doubt that you are spending much time spinning (but it would certainly be interesting to know if you are! You would see `perf` attribute a large amount of time to `read_current_value`). The amount of code to execute for commit (the time when locks are held) is always much shorter than it takes to execute the transaction body. As you add more conflicting threads this gets worse of course as commits sequence.

The code you have will count commits of executions of `retry`. Note that `retry` is a user level idea, that is, you are counting user level *explicit* retries. This is different from a transaction failing to commit and starting again. These are invisible to the user. Also using your trace will convert `retry` from the efficient wake on write implementation, to an active retry that will always attempt again. We don't have cheap logging of transaction aborts in GHC, but I have built such logging in my work. You can observe these aborts with a debugger by looking for execution of this line:

https://github.com/ghc/ghc/blob/master/rts/STM.c#L1123 https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fghc%2Fghc%2Fblob%2Fmaster%2Frts%2FSTM.c%23L1123&data=02%7C01%7Csimonpj%40microsoft.com%7C8ebd68bca55140cebaae08d833f888f2%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637316489838811560&sdata=jAEm1CpEYQx6ORikerxVHOSlaOmrTzB3m9EVmOwo%2B8w%3D&reserved=0

Ryan

On Fri, Jul 24, 2020 at 12:35 PM Compl Yue wrote:

I'm not familiar with profiling GHC yet, may need more time to get myself proficient with it.

And a bit more details of my test workload for diagnostic: the db clients are Python processes from a cluster of worker nodes, consulting the db server to register some path for data files, under a data dir within a shared filesystem, then mmap those data files and fill in actual array data. So the db server don't have much computation to perform, but puts the data file path into a global index, which at the same validates its uniqueness. As there are many client processes trying to insert one meta data record concurrently, with my naive implementation, the global index's TVar will almost always in locked state by one client after another, from a queue never fall empty.

So if `readTVar` should spinning waiting, I doubt the threads should actually make high CPU utilization, because at any instant of time, all threads except the committing one will be doing that one thing.

And I have something in my code to track STM retry like this:

```

-- blocking wait not expected, track stm retries explicitly

trackSTM :: Int -> IO (Either () a)

trackSTM !rtc = do

when -- todo increase the threshold of reporting?

(rtc > 0) $ do

-- trace out the retries so the end users can be aware of them

tid <- myThreadId

trace

( "🔙\n"

<> show callCtx

<> "🌀 "

<> show tid

<> " stm retry #"

<> show rtc

)

$ return ()

atomically ((Just <$> stmJob) `orElse` return Nothing) >>= \case

Nothing -> -- stm failed, do a tracked retry

trackSTM (rtc + 1)

Just ... -> ...

```

No such trace msg fires during my test, neither in single thread run, nor in runs with pressure. I'm sure this tracing mechanism works, as I can see such traces fire, in case e.g. posting a TMVar to a TQueue for some other thread to fill it, then read the result out, if these 2 ops are composed into a single tx, then of course it's infinite retry loop, and a sequence of such msgs are logged with ever increasing rtc #.

So I believe no retry has ever been triggered.

What can going on there?

On 2020/7/24 下午11:46, Ryan Yates wrote:

...
Then to explain the low CPU utilization (~10%), am I right to understand it as that upon reading a TVar locked by another committing tx, a lightweight thread will put itself into `waiting STM` and descheduled state, so the CPUs can only stay idle as not so many threads are willing to proceed?

Since the commit happens in finite steps, the expectation is that the lock will be released very soon. Given this when the body of a transaction executes `readTVar` it spins (active CPU!) until the `TVar` is observed unlocked. If a lock is observed while commiting, it immediately starts the transaction again from the beginning. To get the behavior of suspending a transaction you have to successfully commit a transaction that executed `retry`. Then the transaction is put on the wakeup lists of its read set and subsequent commits will wake it up if its write set overlaps.

I don't think any of these things would explain low CPU utilization. You could try running with `perf` and see if lots of time is spent in some recognizable part of the RTS.

Ryan

On Fri, Jul 24, 2020 at 11:22 AM Compl Yue wrote:

Thanks very much for the insightful information Ryan! I'm glad my suspect was wrong about the Haskell scheduler:

...
The Haskell capability that is committing a transaction will not yield to another Haskell thread while it is doing the commit. The OS thread may be preempted, but once commit starts the haskell scheduler is not invoked until after locks are released.

So best effort had already been made in GHC and I just need to cooperate better with its design. Then to explain the low CPU utilization (~10%), am I right to understand it as that upon reading a TVar locked by another committing tx, a lightweight thread will put itself into `waiting STM` and descheduled state, so the CPUs can only stay idle as not so many threads are willing to proceed?

Anyway, I see light with better data structures to improve my situation, let me try them and report back. Actually I later changed `TVar (HaskMap k v)` to be `TVar (HashMap k Int)` where the `Int` being array index into `TVar (Vector (TVar (Maybe v)))`, in pursuing insertion order preservation semantic of dict entries (like that in Python 3.7+), then it's very hopeful to incorporate stm-containers' Map or ttrie to approach free of contention.

Thanks with regards,

Compl

On 2020/7/24 下午10:03, Ryan Yates wrote:

Hi Compl,

Having a pool of transaction processing threads can be helpful in a certain way. If the body of the transaction takes more time to execute then the Haskell thread is allowed and it yields, the suspended thread won't get in the way of other thread, but when it is rescheduled, will have a low probability of success. Even worse, it will probably not discover that it is doomed to failure until commit time. If transactions are more likely to reach commit without yielding, they are more likely to succeed. If the transactions are not conflicting, it doesn't make much difference other than cache churn.

The Haskell capability that is committing a transaction will not yield to another Haskell thread while it is doing the commit. The OS thread may be preempted, but once commit starts the haskell scheduler is not invoked until after locks are released.

To get good performance from STM you must pay attention to what TVars are involved in a commit. All STM systems are working under the assumption of low contention, so you want to minimize "false" conflicts (conflicts that are not essential to the computation). Something like `TVar (HashMap k v)` will work pretty well for a low thread count, but every transaction that writes to that structure will be in conflict with every other transaction that accesses it. Pushing the `TVar` into the nodes of the structure reduces the possibilities for conflict, while increasing the amount of bookkeeping STM has to do. I would like to reduce the cost of that bookkeeping using better structures, but we need to do so without harming performance in the low TVar count case. Right now it is optimized for good cache performance with a handful of TVars.

There is another way to play with performance by moving work into and out of the transaction body. A transaction body that executes quickly will reach commit faster. But it may be delaying work that moves into another transaction. Forcing values at the right time can make a big difference.

Ryan

On Fri, Jul 24, 2020 at 2:14 AM Compl Yue via Haskell-Cafe < haskell-cafe@haskell.org> wrote:

Thanks Chris, I confess I didn't pay enough attention to STM specialized container libraries by far, I skimmed through the description of stm-containers and ttrie, and feel they would definitely improve my code's performance in case I limit the server's parallelism within hardware capabilities. That may because I'm still prototyping the api and infrastructure for correctness, so even `TVar (HashMap k v)` performs okay for me at the moment, only if at low contention (surely there're plenty of CPU cycles to be optimized out in next steps). I model my data after graph model, so most data, even most indices are localized to nodes and edges, those can be manipulated without conflict, that's why I assumed I have a low contention use case since the very beginning - until I found there are still (though minor) needs for global indices to guarantee global uniqueness, I feel faithful with stm-containers/ttrie to implement a more scalable global index data structure, thanks for hinting me.

So an evident solution comes into my mind now, is to run the server with a pool of tx processing threads, matching number of CPU cores, client RPC requests then get queued to be executed in some thread from the pool. But I'm really fond of the mechanism of M:N scheduler which solves massive/dynamic concurrency so elegantly. I had some good result with Go in this regard, and see GHC at par in doing this, I don't want to give up this enjoyable machinery.

But looked at the stm implementation in GHC, it seems written TVars are exclusively locked during commit of a tx, I suspect this is the culprit when there're large M lightweight threads scheduled upon a small N hardware capabilities, that is when a lightweight thread yield control during an stm transaction commit, the TVars it locked will stay so until it's scheduled again (and again) till it can finish the commit. This way, descheduled threads could hold live threads from progressing. I haven't gone into more details there, but wonder if there can be some improvement for GHC RTS to keep an stm committing thread from descheduled, but seemingly that may impose more starvation potential; or stm can be improved to have its TVar locks preemptable when the owner trec/thread is in descheduled state? Neither should be easy but I'd really love massive lightweight threads doing STM practically well.

Best regards,

Compl

On 2020/7/24 上午12:57, Christopher Allen wrote:

It seems like you know how to run practical tests for tuning thread count and contention for throughput. Part of the reason you haven't gotten a super clear answer is "it depends." You give up fairness when you use STM instead of MVars or equivalent structures. That means a long running transaction might get stampeded by many small ones invalidating it over and over. The long-running transaction might never clear if the small transactions keep moving the cheese. I mention this because transaction runtime and size and count all affect throughput and latency. What might be ideal for one pattern of work might not be ideal for another. Optimizing for overall throughput might make the contention and fairness problems worse too. I've done practical tests to optimize this in the past, both for STM in Haskell and for RDBMS workloads.

The next step is sometimes figuring out whether you really need a data structure within a single STM container or if perhaps you can break up your STM container boundaries into zones or regions that roughly map onto update boundaries. That should make the transactions churn less. On the outside chance you do need to touch more than one container in a transaction, well, they compose.

e.g. https://hackage.haskell.org/package/stm-containers https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fhackage.haskell.org%2Fpackage%2Fstm-containers&data=02%7C01%7Csimonpj%40microsoft.com%7C8ebd68bca55140cebaae08d833f888f2%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637316489838811560&sdata=Lq1%2BGj0Z6%2BBGMRAZrSzcTAlYgj0B0A67RaQcyyCcXbk%3D&reserved=0

https://hackage.haskell.org/package/ttrie https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fhackage.haskell.org%2Fpackage%2Fttrie&data=02%7C01%7Csimonpj%40microsoft.com%7C8ebd68bca55140cebaae08d833f888f2%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637316489838821555&sdata=PpaiVM2NrPM2HzK0bh%2BMR8YF90yHlxKnN9gwZVQHqR0%3D&reserved=0

It also sounds a bit like your question bumps into Amdahl's Law a bit.

All else fails, stop using STM and find something more tuned to your problem space.

Hope this helps,

Chris Allen

On Thu, Jul 23, 2020 at 9:53 AM YueCompl via Haskell-Cafe < haskell-cafe@haskell.org> wrote:

Hello Cafe,

I'm working on an in-memory database, in Client/Server mode I just let each connected client submit remote procedure call running in its dedicated lightweight thread, modifying TVars in RAM per its business needs, then in case many clients connected concurrently and trying to insert new data, if they are triggering global index (some TVar) update, the throughput would drop drastically. I reduced the shared state to a simple int counter by TVar, got same symptom. While the parallelism feels okay when there's no hot TVar conflicting, or M is not much greater than N.

As an empirical test workload, I have a `+RTS -N10` server process, it handles 10 concurrent clients okay, got ~5x of single thread throughput; but in handling 20 concurrent clients, each of the 10 CPUs can only be driven to ~10% utilization, the throughput seems even worse than single thread. More clients can even drive it thrashing without much progressing.

I can understand that pure STM doesn't scale well after reading [1], and I see it suggested [7] attractive and planned future work toward that direction.

But I can't find certain libraries or frameworks addressing large M over small N scenarios, [1] experimented with designated N parallelism, and [7] is rather theoretical to my empirical needs.

Can you direct me to some available library implementing the methodology proposed in [7] or other ways tackling this problem?

I think the most difficult one is that a transaction should commit with global indices (with possibly unique constraints) atomically updated, and rollback with any violation of constraints, i.e. transactions have to cover global states like indices. Other problems seem more trivial than this.

Specifically, [7] states:

...
It must be emphasized that all of the mechanisms we deploy originate, in one form or another, in the database literature from the 70s and 80s. Our contribution is to adapt these techniques to software transactional memory, providing more effective solutions to important STM problems than prior proposals.

I wonder any STM based library has simplified those techniques to be composed right away? I don't really want to implement those mechanisms by myself, rebuilding many wheels from scratch.

Best regards,

Compl

[1] Comparing the performance of concurrent linked-list implementations in Haskell

https://simonmar.github.io/bib/papers/concurrent-data.pdf https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fsimonmar.github.io%2Fbib%2Fpapers%2Fconcurrent-data.pdf&data=02%7C01%7Csimonpj%40microsoft.com%7C8ebd68bca55140cebaae08d833f888f2%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637316489838821555&sdata=41Jaz8ZRmRfBHyGKxfhJlm4xR7q0pOtJShtO0jTlOwQ%3D&reserved=0

[7] M. Herlihy and E. Koskinen. Transactional boosting: a methodology for highly-concurrent transactional objects. In Proc. of PPoPP ’08, pages 207–216. ACM Press, 2008.

https://www.cs.stevens.edu/~ejk/papers/boosting-ppopp08.pdf https://nam06.safelinks.protection.outlook.com/?url=https:%2F%2Fwww.cs.stevens.edu%2F~ejk%2Fpapers%2Fboosting-ppopp08.pdf&data=02%7C01%7Csimonpj%40microsoft.com%7C8ebd68bca55140cebaae08d833f888f2%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637316489838831548&sdata=ya8Az1oC6f2xoMb90S9HCH57UTQ0nV9sg6SW%2B5JCPC4%3D&reserved=0

_______________________________________________ Haskell-Cafe mailing list To (un)subscribe, modify options or view archives go to: http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe https://nam06.safelinks.protection.outlook.com/?url=http%3A%2F%2Fmail.haskell.org%2Fcgi-bin%2Fmailman%2Flistinfo%2Fhaskell-cafe&data=02%7C01%7Csimonpj%40microsoft.com%7C8ebd68bca55140cebaae08d833f888f2%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637316489838831548&sdata=c2AV7CO42o3tcw0EuMzqedKkBCtQjWjvdMoUsb4llbY%3D&reserved=0 Only members subscribed via the mailman list are allowed to post.

--

Chris Allen

Currently working on http://haskellbook.com https://nam06.safelinks.protection.outlook.com/?url=http%3A%2F%2Fhaskellbook.com%2F&data=02%7C01%7Csimonpj%40microsoft.com%7C8ebd68bca55140cebaae08d833f888f2%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637316489838831548&sdata=tIHFQFZPIQgRp8oqGRvyebm1YQdCvGD0VoMcflzJwKc%3D&reserved=0

_______________________________________________ Haskell-Cafe mailing list To (un)subscribe, modify options or view archives go to: http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe https://nam06.safelinks.protection.outlook.com/?url=http%3A%2F%2Fmail.haskell.org%2Fcgi-bin%2Fmailman%2Flistinfo%2Fhaskell-cafe&data=02%7C01%7Csimonpj%40microsoft.com%7C8ebd68bca55140cebaae08d833f888f2%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637316489838841547&sdata=vdzv5WBA62cNwO6DA1D4KEHDCweyOerpn1PdMK0A%2BHw%3D&reserved=0 Only members subscribed via the mailman list are allowed to post.

_______________________________________________

Haskell-Cafe mailing list

To (un)subscribe, modify options or view archives go to:

http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe https://nam06.safelinks.protection.outlook.com/?url=http%3A%2F%2Fmail.haskell.org%2Fcgi-bin%2Fmailman%2Flistinfo%2Fhaskell-cafe&data=02%7C01%7Csimonpj%40microsoft.com%7C8ebd68bca55140cebaae08d833f888f2%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637316489838841547&sdata=vdzv5WBA62cNwO6DA1D4KEHDCweyOerpn1PdMK0A%2BHw%3D&reserved=0

Only members subscribed via the mailman list are allowed to post.

_______________________________________________ Haskell-Cafe mailing list To (un)subscribe, modify options or view archives go to: http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe https://nam06.safelinks.protection.outlook.com/?url=http%3A%2F%2Fmail.haskell.org%2Fcgi-bin%2Fmailman%2Flistinfo%2Fhaskell-cafe&data=02%7C01%7Csimonpj%40microsoft.com%7C8ebd68bca55140cebaae08d833f888f2%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637316489838851540&sdata=Btpa3sjfAjTf2ICO0QpQG5vVCawIjERNjUHji06uG5Y%3D&reserved=0 Only members subscribed via the mailman list are allowed to post.

_______________________________________________ Haskell-Cafe mailing list To (un)subscribe, modify options or view archives go to: http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe https://nam06.safelinks.protection.outlook.com/?url=http%3A%2F%2Fmail.haskell.org%2Fcgi-bin%2Fmailman%2Flistinfo%2Fhaskell-cafe&data=02%7C01%7Csimonpj%40microsoft.com%7C8ebd68bca55140cebaae08d833f888f2%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637316489838851540&sdata=Btpa3sjfAjTf2ICO0QpQG5vVCawIjERNjUHji06uG5Y%3D&reserved=0 Only members subscribed via the mailman list are allowed to post.

Compl Yue

5:31 a.m.

New subject: STM friendly TreeMap (or similar with range scan api) ? WAS: Best ways to achieve throughput, for large M:N ratio of STM threads, with hot TVar updates?

Thanks Ryan, and I'm honored to get Simon's attention. I did have some worry about package tskiplist, that its github repository seems withdrawn, I emailed the maintainer Peter Robinson lately but have gotten no response by far. What particularly worrying me is the 1st sentence of the Readme has changed from 1.0.0 to 1.0.1 (which is current) as:

...

- This package provides an implementation of a skip list in STM.

...

+ This package provides a proof-of-concept implementation of a skip list in STM

This has to mean something but I can't figure out yet. Dear Peter Robinson, I hope you can see this message and get in the loop of discussion. Despite that, I don't think overhead of TVar itself the most serious issue in my situation, as before GC engagement, there are as many TVars being allocated and updated without stuck at business progressing. And now I realize what presuring GC in my situation is not only the large number of pointers (TVars), and at the same time, they form many circular structures, that might be nightmare for a GC. As I model my data after graph model, in my test workload, there are many FeatureSet instances each being an entity/node object, then there are many Feature instances per FeatureSet object, each Feature instance being an unary relationship/edge object, with a reference attribute (via TVar) pointing to the FeatureSet object it belongs to, circular structures form because I maintain an index at each FeatureSet object, sorted by weight etc., but ultimately pointing back (via TVar) to all Feature objects belonging to the set. I'm still curious why the new non-moving GC in 8.10.1 still don't get obvious business progressing in my situation. I tested it on my Mac yesterday and there I don't know how to see how CPU time is distributed over threads within a process, I'll further test it with some Linux boxes to try understand it better. Best regards, Compl On 2020/7/30 上午10:05, Ryan Yates wrote:

...

Simon, I certainly want to help get to the bottom of the performance issue at hand :D. Sorry if my reply was misleading. The constant factor overhead of pushing `TVar`s into the internal structure may be pressuring unacceptable GC behavior to happen sooner. My impression was that given the same size problem performance loss shifted from synchronization to GC.

Compl, I'm not aware of mutable heap objects being problematic in particular for GHC's GC. There are lots of special cases to handle them of course. I have successfully written Haskell programs that get good performance from the GC with the dominant fraction of heap objects being mutable. I looked a little more at `TSkipList` and one tricky aspect of an STM based skip list is how to manage randomness. In `TSkipList`'s code there is the following comment:

-- | Returns a randomly chosen level. Used for inserting new elements. /O(1)./ -- For performance reasons, this function uses 'unsafePerformIO' to access the -- random number generator. (It would be possible to store the random number -- generator in a 'TVar' and thus be able to access it safely from within the -- STM monad. This, however, might cause high contention among threads.) chooseLevel :: TSkipList http://hackage.haskell.org/package/tskiplist-1.0.1/docs/src/Control.Concurre... k http://hackage.haskell.org/package/tskiplist-1.0.1/docs/src/Control.Concurre... a http://hackage.haskell.org/package/tskiplist-1.0.1/docs/src/Control.Concurre... -> Int

This level is chosen on insertion to determine the height of the node. When writing my own STM skiplist I found that the details in unsafely accessing randomness had a significant impact on performance. We went with an unboxed array of PCG states that had an entry for each capability giving constant memory overhead in the number of capabilities. `TSkipList` uses `newStdGen` which involves allocation and synchronization.

Again, I'm not pointing this out to say that this is the entirety of the issue you are encountering, rather, I do think the `TSkipList` library could be improved to allocate much less. Others can speak to how to tell where the time is going in GC (my knowledge of this is likely out of date).

Ryan

On Wed, Jul 29, 2020 at 4:57 PM Simon Peyton Jones mailto:simonpj@microsoft.com> wrote:

Compl’s problem is (apparently) that execution becomes dominated by GC. That doesn’t sound like a constant-factor overhead from TVars, no matter how efficient (or otherwise) they are. It sounds more like a space leak to me; perhaps you need some strict evaluation or something.

My point is only: before re-engineering STM it would make sense to get a much more detailed insight into what is actually happening, and where the space and time is going. We have tools to do this (heap profiling, Threadscope, …) but I know they need some skill and insight to use well. But we don’t have nearly enough insight to draw meaningful conclusions yet.

Maybe someone with experience of performance debugging might feel able to help Compl?

Simon

*From:*Haskell-Cafe mailto:haskell-cafe-bounces@haskell.org> *On Behalf Of *Ryan Yates *Sent:* 29 July 2020 20:41 *To:* YueCompl mailto:compl.yue@icloud.com> *Cc:* Haskell Cafe mailto:haskell-cafe@haskell.org> *Subject:* Re: [Haskell-cafe] STM friendly TreeMap (or similar with range scan api) ? WAS: Best ways to achieve throughput, for large M:N ratio of STM threads, with hot TVar updates?

Hi Compl,

There is a lot of overhead with TVars. My thesis work addresses this by incorporating mutable constructor fields with STM. I would like to get all that into GHC as soon as I can :D. I haven't looked closely at the `tskiplist` package, I'll take a look and see if I see any potential issues. There was some recent work on concurrent B-tree that may be interesting to try.

Ryan

On Wed, Jul 29, 2020 at 10:24 AM YueCompl mailto:compl.yue@icloud.com> wrote:

Hi Cafe and Ryan,

I tried Map/Set from stm-containers and TSkipList (added range scan api against its internal data structure) from http://hackage.haskell.org/package/tskiplist https://nam06.safelinks.protection.outlook.com/?url=http%3A%2F%2Fhackage.haskell.org%2Fpackage%2Ftskiplist&data=02%7C01%7Csimonpj%40microsoft.com%7C8ebd68bca55140cebaae08d833f888f2%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637316489838761589&sdata=ZOvJVBqJgdGqx2k%2F49fhZeTYkWAd4GRY%2B8ZxH7cyEkI%3D&reserved=0 , with them I've got quite improved at scalability on concurrency.

But unfortunately then I hit another wall at single thread scalability over working memory size, I suspect it's because massively more TVars (those being pointers per se) are introduced by those "contention-free" data structures, they need to mutate separate pointers concurrently in avoiding contentions anyway, but such pointer-intensive heap seems imposing extraordinary pressure to GHC's garbage collector, that GC will dominate CPU utilization with poor business progress.

For example in my test, I use `+RTS -H2g` for the Haskell server process, so GC is not triggered until after a while, then spin off 3 Python client to insert new records concurrently, in the first stage each Python process happily taking ~90% CPU filling (through local mmap) the arrays allocated from the server and logs of success scroll quickly, while the server process utilizes only 30~40% CPU to serve those 3 clients (insert meta data records into unique indices merely); then the client processes' CPU utilization drop drastically once Haskell server process' private memory reached around 2gb, i.e. GC started engaging, the server process's CPU utilization quickly approaches ~300%, while all client processes' drop to 0% for most of the time, and occasionally burst a tiny while with some log output showing progress. And I disable parallel GC lately, enabling parallel GC only makes it worse.

If I comment out the code updating the indices (those creating many TVars), the overall throughput only drop slowly as more data are inserted, the parallelism feels steady even after the server process' private memory takes several GBs.

I didn't expect this, but appears to me that GC of GHC is really not good at handling massive number of pointers in the heap, while those pointers are essential to reduce contention (and maybe expensive data copying too) at heavy parallelism/concurrency.

Btw I tried `+RTS -xn` with GHC 8.10.1 too, no obvious different behavior compared to 8.8.3; and also tried tweaking GC related RTS options a bit, including increasing -G up to 10, no much difference too.

I feel hopeless at the moment, wondering if I'll have to rewrite this in-memory db in Go/Rust or some other runtime ...

Btw I read https://tech.channable.com/posts/2020-04-07-lessons-in-managing-haskell-memo... https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Ftech.channable.com%2Fposts%2F2020-04-07-lessons-in-managing-haskell-memory.html&data=02%7C01%7Csimonpj%40microsoft.com%7C8ebd68bca55140cebaae08d833f888f2%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637316489838761589&sdata=gqSH82%2FOYRaW4fzBDl%2BLDjhbRA%2BDRE6jaj4k1UI2gFE%3D&reserved=0 in searching about the symptoms, and don't feel likely to convert my DB managed data into immutable types thus to fit into Compact Regions, not quite likely a live in-mem database instance can do.

So seems there are good reasons no successful DBMS, at least in-memory ones have been written in Haskell.

Best regards,

Compl

On 2020-07-25, at 22:07, Ryan Yates mailto:fryguybob@gmail.com> wrote:

Unfortunately my STM benchmarks are rather disorganized. The most relevant paper using them is:

Leveraging hardware TM in Haskell (PPoPP '19)

https://dl.acm.org/doi/10.1145/3293883.3295711 https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdl.acm.org%2Fdoi%2F10.1145%2F3293883.3295711&data=02%7C01%7Csimonpj%40microsoft.com%7C8ebd68bca55140cebaae08d833f888f2%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637316489838771582&sdata=h3po1gPutR%2BsiCST1N0RNkM6irnVL0%2BVbYl3Vs8F8Oc%3D&reserved=0

Or my thesis:

https://urresearch.rochester.edu/institutionalPublicationPublicView.action?i... https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Furresearch.rochester.edu%2FinstitutionalPublicationPublicView.action%3FinstitutionalItemId%3D34931&data=02%7C01%7Csimonpj%40microsoft.com%7C8ebd68bca55140cebaae08d833f888f2%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637316489838771582&sdata=jBQMX5RRajIj0KbLWQCMt%2BMyMJIEmTpSuEHBWpq5Isg%3D&reserved=0

The PPoPP benchmarks are on a branch (or the releases tab on github):

https://github.com/fryguybob/ghc-stm-benchmarks/tree/wip/mutable-fields/benc... https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Ffryguybob%2Fghc-stm-benchmarks%2Ftree%2Fwip%2Fmutable-fields%2Fbenchmarks%2FPPoPP2019%2Fsrc&data=02%7C01%7Csimonpj%40microsoft.com%7C8ebd68bca55140cebaae08d833f888f2%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637316489838771582&sdata=PinsrrGPgAB9TgxH61xngSItw1DcIRf1Niq39b%2BOe0s%3D&reserved=0

All that to say, without an implementation of mutable constructor fields (which I'm working on getting into GHC) the scaling is limited.

Ryan

On Sat, Jul 25, 2020 at 3:45 AM Compl Yue via Haskell-Cafe mailto:haskell-cafe@haskell.org> wrote:

Dear Cafe,

As Chris Allen has suggested, I learned that https://hackage.haskell.org/package/stm-containers https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fhackage.haskell.org%2Fpackage%2Fstm-containers&data=02%7C01%7Csimonpj%40microsoft.com%7C8ebd68bca55140cebaae08d833f888f2%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637316489838781576&sdata=ZwtAltlFRkny5q7M%2B7Pople6c4WA%2Bs8vZhwewUge7eg%3D&reserved=0 and https://hackage.haskell.org/package/ttrie https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fhackage.haskell.org%2Fpackage%2Fttrie&data=02%7C01%7Csimonpj%40microsoft.com%7C8ebd68bca55140cebaae08d833f888f2%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637316489838781576&sdata=zMcZy%2BEzqklkQGjKglCgwg5ZoWyWZIyeRNaCcqtnECs%3D&reserved=0 can help a lot when used in place of traditional HashMap for stm tx processing, under heavy concurrency, yet still with automatic parallelism as GHC implemented them. Then I realized that in addition to hash map (used to implement dicts and scopes), I also need to find a TreeMap replacement data structure to implement the db index. I've been focusing on the uniqueness constraint aspect, but it's still an index, needs to provide range scan api for db clients, so hash map is not sufficient for the index.

I see Ryan shared the code benchmarking RBTree with stm in mind:

https://github.com/fryguybob/ghc-stm-benchmarks/tree/master/benchmarks/RBTre... https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Ffryguybob%2Fghc-stm-benchmarks%2Ftree%2Fmaster%2Fbenchmarks%2FRBTree-Throughput&data=02%7C01%7Csimonpj%40microsoft.com%7C8ebd68bca55140cebaae08d833f888f2%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637316489838791571&sdata=Nl2eN81Kjaf5qyNKEaxxc0ioMw6w4QoX4b5vAE5RaF8%3D&reserved=0

https://github.com/fryguybob/ghc-stm-benchmarks/tree/master/benchmarks/RBTre... https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Ffryguybob%2Fghc-stm-benchmarks%2Ftree%2Fmaster%2Fbenchmarks%2FRBTree&data=02%7C01%7Csimonpj%40microsoft.com%7C8ebd68bca55140cebaae08d833f888f2%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637316489838791571&sdata=%2BLp6HQCyROOlpA2pr8BR8DPls68oY5Y77GKgqbSKmno%3D&reserved=0

But can't find conclusion or interpretation of that benchmark suite. And here's a followup question:

Where are some STM contention optimized data structures, that having keys ordered, with sub-range traversing api ?

(of course production ready libraries most desirable)

Thanks with regards,

Compl

On 2020/7/25 下午2:04, Compl Yue via Haskell-Cafe wrote:

Shame on me for I have neither experienced with `perf`, I'd learn these essential tools soon to put them into good use.

It's great to learn about how `orElse` actually works, I did get confused why there are so little retries captured, and now I know. So that little trick should definitely be removed before going production, as it does no much useful things at excessive cost. I put it there to help me understand internal working of stm, now I get even better knowledge ;-)

I think a debugger will trap every single abort, isn't it annoying when many aborts would occur? If I'd like to count the number of aborts, ideally accounted per service endpoints, time periods, source modules etc. there some tricks for that?

Thanks with best regards,

Compl

On 2020/7/25 上午2:02, Ryan Yates wrote:

To be clear, I was trying to refer to Linux `perf` [^1]. Sampling based profiling can do a good job with concurrent and parallel programs where other methods are problematic. For instance,

changing the size of heap objects can drastically change cache performance and completely different behavior can show up.

[^1]: https://en.wikipedia.org/wiki/Perf_(Linux) https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fen.wikipedia.org%2Fwiki%2FPerf_(Linux)&data=02%7C01%7Csimonpj%40microsoft.com%7C8ebd68bca55140cebaae08d833f888f2%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637316489838801566&sdata=v%2Bv2aVaBITriAM26CqN%2Bp35yshLl%2BbY4BWVEIOSlStA%3D&reserved=0

The spinning in `readTVar` should always be very short and it typically shows up as intensive CPU use, though it may not be high energy use with `pause` in the loop on x86 (looks like we don't have it [^2], I thought we did, but maybe that was only in some of my code... )

[^2]: https://github.com/ghc/ghc/blob/master/rts/STM.c#L1275 https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fghc%2Fghc%2Fblob%2Fmaster%2Frts%2FSTM.c%23L1275&data=02%7C01%7Csimonpj%40microsoft.com%7C8ebd68bca55140cebaae08d833f888f2%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637316489838801566&sdata=YBLmeg4Xxby%2BJJmO8B5etdA6tDpBYOry7jdjEoRFd%2Fk%3D&reserved=0

All that to say, I doubt that you are spending much time spinning (but it would certainly be interesting to know if you are! You would see `perf` attribute a large amount of time to `read_current_value`). The amount of code to execute for commit (the time when locks are held) is always much shorter than it takes to execute the transaction body. As you add more conflicting threads this gets worse of course as commits sequence.

The code you have will count commits of executions of `retry`. Note that `retry` is a user level idea, that is, you are counting user level *explicit* retries. This is different from a transaction failing to commit and starting again. These are invisible to the user. Also using your trace will convert `retry` from the efficient wake on write implementation, to an active retry that will always attempt again. We don't have cheap logging of transaction aborts in GHC, but I have built such logging in my work. You can observe these aborts with a debugger by looking for execution of this line:

https://github.com/ghc/ghc/blob/master/rts/STM.c#L1123 https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fghc%2Fghc%2Fblob%2Fmaster%2Frts%2FSTM.c%23L1123&data=02%7C01%7Csimonpj%40microsoft.com%7C8ebd68bca55140cebaae08d833f888f2%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637316489838811560&sdata=jAEm1CpEYQx6ORikerxVHOSlaOmrTzB3m9EVmOwo%2B8w%3D&reserved=0

Ryan

On Fri, Jul 24, 2020 at 12:35 PM Compl Yue mailto:compl.yue@icloud.com> wrote:

I'm not familiar with profiling GHC yet, may need more time to get myself proficient with it.

And a bit more details of my test workload for diagnostic: the db clients are Python processes from a cluster of worker nodes, consulting the db server to register some path for data files, under a data dir within a shared filesystem, then mmap those data files and fill in actual array data. So the db server don't have much computation to perform, but puts the data file path into a global index, which at the same validates its uniqueness. As there are many client processes trying to insert one meta data record concurrently, with my naive implementation, the global index's TVar will almost always in locked state by one client after another, from a queue never fall empty.

So if `readTVar` should spinning waiting, I doubt the threads should actually make high CPU utilization, because at any instant of time, all threads except the committing one will be doing that one thing.

And I have something in my code to track STM retry like this:

```

-- blocking wait not expected, track stm retries explicitly

trackSTM:: Int-> IO(Either() a)

trackSTM !rtc = do

when -- todo increase the threshold of reporting?

(rtc > 0) $ do

-- trace out the retries so the end users can be aware of them

tid <- myThreadId

trace

( "🔙\n"

<> show callCtx

<> "🌀"

<> show tid

<> " stm retry #"

<> show rtc

)

$ return ()

atomically ((Just <$> stmJob) `orElse` return Nothing) >>= \case

Nothing -> -- stm failed, do a tracked retry

trackSTM (rtc + 1)

Just ... -> ...

```

No such trace msg fires during my test, neither in single thread run, nor in runs with pressure. I'm sure this tracing mechanism works, as I can see such traces fire, in case e.g. posting a TMVar to a TQueue for some other thread to fill it, then read the result out, if these 2 ops are composed into a single tx, then of course it's infinite retry loop, and a sequence of such msgs are logged with ever increasing rtc #.

So I believe no retry has ever been triggered.

What can going on there?

On 2020/7/24 下午11:46, Ryan Yates wrote:

> Then to explain the low CPU utilization (~10%), am I right to understand it as that upon reading a TVar locked by another committing tx, a lightweight thread will put itself into `waiting STM` and descheduled state, so the CPUs can only stay idle as not so many threads are willing to proceed?

Since the commit happens in finite steps, the expectation is that the lock will be released very soon. Given this when the body of a transaction executes `readTVar` it spins (active CPU!) until the `TVar` is observed unlocked. If a lock is observed while commiting, it immediately starts the transaction again from the beginning. To get the behavior of suspending a transaction you have to successfully commit a transaction that executed `retry`. Then the transaction is put on the wakeup lists of its read set and subsequent commits will wake it up if its write set overlaps.

I don't think any of these things would explain low CPU utilization. You could try running with `perf` and see if lots of time is spent in some recognizable part of the RTS.

Ryan

On Fri, Jul 24, 2020 at 11:22 AM Compl Yue mailto:compl.yue@icloud.com> wrote:

Thanks very much for the insightful information Ryan! I'm glad my suspect was wrong about the Haskell scheduler:

> The Haskell capability that is committing a transaction will not yield to another Haskell thread while it is doing the commit. The OS thread may be preempted, but once commit starts the haskell scheduler is not invoked until after locks are released.

So best effort had already been made in GHC and I just need to cooperate better with its design. Then to explain the low CPU utilization (~10%), am I right to understand it as that upon reading a TVar locked by another committing tx, a lightweight thread will put itself into `waiting STM` and descheduled state, so the CPUs can only stay idle as not so many threads are willing to proceed?

Anyway, I see light with better data structures to improve my situation, let me try them and report back. Actually I later changed `TVar (HaskMap k v)` to be `TVar (HashMap k Int)` where the `Int` being array index into `TVar (Vector (TVar (Maybe v)))`, in pursuing insertion order preservation semantic of dict entries (like that in Python 3.7+), then it's very hopeful to incorporate stm-containers' Map or ttrie to approach free of contention.

Thanks with regards,

Compl

On 2020/7/24 下午10:03, Ryan Yates wrote:

Hi Compl,

Having a pool of transaction processing threads can be helpful in a certain way. If the body of the transaction takes more time to execute then the Haskell thread is allowed and it yields, the suspended thread won't get in the way of other thread, but when it is rescheduled, will have a low probability of success. Even worse, it will probably not discover that it is doomed to failure until commit time. If transactions are more likely to reach commit without yielding, they are more likely to succeed. If the transactions are not conflicting, it doesn't make much difference other than cache churn.

The Haskell capability that is committing a transaction will not yield to another Haskell thread while it is doing the commit. The OS thread may be preempted, but once commit starts the haskell scheduler is not invoked until after locks are released.

To get good performance from STM you must pay attention to what TVars are involved in a commit. All STM systems are working under the assumption of low contention, so you want to minimize "false" conflicts (conflicts that are not essential to the computation). Something like `TVar (HashMap k v)` will work pretty well for a low thread count, but every transaction that writes to that structure will be in conflict with every other transaction that accesses it. Pushing the `TVar` into the nodes of the structure reduces the possibilities for conflict, while increasing the amount of bookkeeping STM has to do. I would like to reduce the cost of that bookkeeping using better structures, but we need to do so without harming performance in the low TVar count case. Right now it is optimized for good cache performance with a handful of TVars.

There is another way to play with performance by moving work into and out of the transaction body. A transaction body that executes quickly will reach commit faster. But it may be delaying work that moves into another transaction. Forcing values at the right time can make a big difference.

Ryan

On Fri, Jul 24, 2020 at 2:14 AM Compl Yue via Haskell-Cafe mailto:haskell-cafe@haskell.org> wrote:

Thanks Chris, I confess I didn't pay enough attention to STM specialized container libraries by far, I skimmed through the description of stm-containers and ttrie, and feel they would definitely improve my code's performance in case I limit the server's parallelism within hardware capabilities. That may because I'm still prototyping the api and infrastructure for correctness, so even `TVar (HashMap k v)` performs okay for me at the moment, only if at low contention (surely there're plenty of CPU cycles to be optimized out in next steps). I model my data after graph model, so most data, even most indices are localized to nodes and edges, those can be manipulated without conflict, that's why I assumed I have a low contention use case since the very beginning - until I found there are still (though minor) needs for global indices to guarantee global uniqueness, I feel faithful with stm-containers/ttrie to implement a more scalable global index data structure, thanks for hinting me.

So an evident solution comes into my mind now, is to run the server with a pool of tx processing threads, matching number of CPU cores, client RPC requests then get queued to be executed in some thread from the pool. But I'm really fond of the mechanism of M:N scheduler which solves massive/dynamic concurrency so elegantly. I had some good result with Go in this regard, and see GHC at par in doing this, I don't want to give up this enjoyable machinery.

But looked at the stm implementation in GHC, it seems written TVars are exclusively locked during commit of a tx, I suspect this is the culprit when there're large M lightweight threads scheduled upon a small N hardware capabilities, that is when a lightweight thread yield control during an stm transaction commit, the TVars it locked will stay so until it's scheduled again (and again) till it can finish the commit. This way, descheduled threads could hold live threads from progressing. I haven't gone into more details there, but wonder if there can be some improvement for GHC RTS to keep an stm committing thread from descheduled, but seemingly that may impose more starvation potential; or stm can be improved to have its TVar locks preemptable when the owner trec/thread is in descheduled state? Neither should be easy but I'd really love massive lightweight threads doing STM practically well.

Best regards,

Compl

On 2020/7/24 上午12:57, Christopher Allen wrote:

It seems like you know how to run practical tests for tuning thread count and contention for throughput. Part of the reason you haven't gotten a super clear answer is "it depends." You give up fairness when you use STM instead of MVars or equivalent structures. That means a long running transaction might get stampeded by many small ones invalidating it over and over. The long-running transaction might never clear if the small transactions keep moving the cheese. I mention this because transaction runtime and size and count all affect throughput and latency. What might be ideal for one pattern of work might not be ideal for another. Optimizing for overall throughput might make the contention and fairness problems worse too. I've done practical tests to optimize this in the past, both for STM in Haskell and for RDBMS workloads.

The next step is sometimes figuring out whether you really need a data structure within a single STM container or if perhaps you can break up your STM container boundaries into zones or regions that roughly map onto update boundaries. That should make the transactions churn less. On the outside chance you do need to touch more than one container in a transaction, well, they compose.

e.g. https://hackage.haskell.org/package/stm-containers https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fhackage.haskell.org%2Fpackage%2Fstm-containers&data=02%7C01%7Csimonpj%40microsoft.com%7C8ebd68bca55140cebaae08d833f888f2%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637316489838811560&sdata=Lq1%2BGj0Z6%2BBGMRAZrSzcTAlYgj0B0A67RaQcyyCcXbk%3D&reserved=0

https://hackage.haskell.org/package/ttrie https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fhackage.haskell.org%2Fpackage%2Fttrie&data=02%7C01%7Csimonpj%40microsoft.com%7C8ebd68bca55140cebaae08d833f888f2%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637316489838821555&sdata=PpaiVM2NrPM2HzK0bh%2BMR8YF90yHlxKnN9gwZVQHqR0%3D&reserved=0

It also sounds a bit like your question bumps into Amdahl's Law a bit.

All else fails, stop using STM and find something more tuned to your problem space.

Hope this helps,

Chris Allen

On Thu, Jul 23, 2020 at 9:53 AM YueCompl via Haskell-Cafe mailto:haskell-cafe@haskell.org> wrote:

Hello Cafe,

I'm working on an in-memory database, in Client/Server mode I just let each connected client submit remote procedure call running in its dedicated lightweight thread, modifying TVars in RAM per its business needs, then in case many clients connected concurrently and trying to insert new data, if they are triggering global index (some TVar) update, the throughput would drop drastically. I reduced the shared state to a simple int counter by TVar, got same symptom. While the parallelism feels okay when there's no hot TVar conflicting, or M is not much greater than N.

As an empirical test workload, I have a `+RTS -N10` server process, it handles 10 concurrent clients okay, got ~5x of single thread throughput; but in handling 20 concurrent clients, each of the 10 CPUs can only be driven to ~10% utilization, the throughput seems even worse than single thread. More clients can even drive it thrashing without much progressing.

I can understand that pure STM doesn't scale well after reading [1], and I see it suggested [7] attractive and planned future work toward that direction.

But I can't find certain libraries or frameworks addressing large M over small N scenarios, [1] experimented with designated N parallelism, and [7] is rather theoretical to my empirical needs.

Can you direct me to some available library implementing the methodology proposed in [7] or other ways tackling this problem?

I think the most difficult one is that a transaction should commit with global indices (with possibly unique constraints) atomically updated, and rollback with any violation of constraints, i.e. transactions have to cover global states like indices. Other problems seem more trivial than this.

Specifically, [7] states:

> It must be emphasized that all of the mechanisms we deploy originate, in one form or another, in the database literature from the 70s and 80s. Our contribution is to adapt these techniques to software transactional memory, providing more effective solutions to important STM problems than prior proposals.

I wonder any STM based library has simplified those techniques to be composed right away? I don't really want to implement those mechanisms by myself, rebuilding many wheels from scratch.

Best regards,

Compl

[1] Comparing the performance of concurrent linked-list implementations in Haskell

https://simonmar.github.io/bib/papers/concurrent-data.pdf https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fsimonmar.github.io%2Fbib%2Fpapers%2Fconcurrent-data.pdf&data=02%7C01%7Csimonpj%40microsoft.com%7C8ebd68bca55140cebaae08d833f888f2%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637316489838821555&sdata=41Jaz8ZRmRfBHyGKxfhJlm4xR7q0pOtJShtO0jTlOwQ%3D&reserved=0

[7] M. Herlihy and E. Koskinen. Transactional boosting: a methodology for highly-concurrent transactional objects. In Proc. of PPoPP ’08, pages 207–216. ACM Press, 2008.

https://www.cs.stevens.edu/~ejk/papers/boosting-ppopp08.pdf https://nam06.safelinks.protection.outlook.com/?url=https:%2F%2Fwww.cs.stevens.edu%2F~ejk%2Fpapers%2Fboosting-ppopp08.pdf&data=02%7C01%7Csimonpj%40microsoft.com%7C8ebd68bca55140cebaae08d833f888f2%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637316489838831548&sdata=ya8Az1oC6f2xoMb90S9HCH57UTQ0nV9sg6SW%2B5JCPC4%3D&reserved=0

_______________________________________________ Haskell-Cafe mailing list To (un)subscribe, modify options or view archives go to: http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe https://nam06.safelinks.protection.outlook.com/?url=http%3A%2F%2Fmail.haskell.org%2Fcgi-bin%2Fmailman%2Flistinfo%2Fhaskell-cafe&data=02%7C01%7Csimonpj%40microsoft.com%7C8ebd68bca55140cebaae08d833f888f2%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637316489838831548&sdata=c2AV7CO42o3tcw0EuMzqedKkBCtQjWjvdMoUsb4llbY%3D&reserved=0 Only members subscribed via the mailman list are allowed to post.

--

Chris Allen

Currently working on http://haskellbook.com https://nam06.safelinks.protection.outlook.com/?url=http%3A%2F%2Fhaskellbook.com%2F&data=02%7C01%7Csimonpj%40microsoft.com%7C8ebd68bca55140cebaae08d833f888f2%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637316489838831548&sdata=tIHFQFZPIQgRp8oqGRvyebm1YQdCvGD0VoMcflzJwKc%3D&reserved=0

_______________________________________________ Haskell-Cafe mailing list To (un)subscribe, modify options or view archives go to: http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe https://nam06.safelinks.protection.outlook.com/?url=http%3A%2F%2Fmail.haskell.org%2Fcgi-bin%2Fmailman%2Flistinfo%2Fhaskell-cafe&data=02%7C01%7Csimonpj%40microsoft.com%7C8ebd68bca55140cebaae08d833f888f2%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637316489838841547&sdata=vdzv5WBA62cNwO6DA1D4KEHDCweyOerpn1PdMK0A%2BHw%3D&reserved=0 Only members subscribed via the mailman list are allowed to post.

_______________________________________________

Haskell-Cafe mailing list

To (un)subscribe, modify options or view archives go to:

http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe https://nam06.safelinks.protection.outlook.com/?url=http%3A%2F%2Fmail.haskell.org%2Fcgi-bin%2Fmailman%2Flistinfo%2Fhaskell-cafe&data=02%7C01%7Csimonpj%40microsoft.com%7C8ebd68bca55140cebaae08d833f888f2%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637316489838841547&sdata=vdzv5WBA62cNwO6DA1D4KEHDCweyOerpn1PdMK0A%2BHw%3D&reserved=0

Only members subscribed via the mailman list are allowed to post.

_______________________________________________ Haskell-Cafe mailing list To (un)subscribe, modify options or view archives go to: http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe https://nam06.safelinks.protection.outlook.com/?url=http%3A%2F%2Fmail.haskell.org%2Fcgi-bin%2Fmailman%2Flistinfo%2Fhaskell-cafe&data=02%7C01%7Csimonpj%40microsoft.com%7C8ebd68bca55140cebaae08d833f888f2%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637316489838851540&sdata=Btpa3sjfAjTf2ICO0QpQG5vVCawIjERNjUHji06uG5Y%3D&reserved=0 Only members subscribed via the mailman list are allowed to post.

_______________________________________________ Haskell-Cafe mailing list To (un)subscribe, modify options or view archives go to: http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe https://nam06.safelinks.protection.outlook.com/?url=http%3A%2F%2Fmail.haskell.org%2Fcgi-bin%2Fmailman%2Flistinfo%2Fhaskell-cafe&data=02%7C01%7Csimonpj%40microsoft.com%7C8ebd68bca55140cebaae08d833f888f2%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637316489838851540&sdata=Btpa3sjfAjTf2ICO0QpQG5vVCawIjERNjUHji06uG5Y%3D&reserved=0 Only members subscribed via the mailman list are allowed to post.

Joachim Durchholz

7:24 a.m.

New subject: STM friendly TreeMap (or similar with range scan api) ? WAS: Best ways to achieve throughput, for large M:N ratio of STM threads, with hot TVar updates?

Am 30.07.20 um 07:31 schrieb Compl Yue via Haskell-Cafe:

...

And now I realize what presuring GC in my situation is not only the large number of pointers (TVars), and at the same time, they form many circular structures, that might be nightmare for a GC.

Cycles are relevant only for reference-counting collectors. As far as I understand http://downloads.haskell.org/~ghc/latest/docs/html/users_guide/runtime_contr..., GHC offers only tracing collectors, and cycles are irrelevant there.

...

I'm still curious why the new non-moving GC in 8.10.1 still don't get obvious business progressing in my situation. I tested it on my Mac yesterday and there I don't know how to see how CPU time is distributed over threads within a process, I'll further test it with some Linux boxes to try understand it better.

Hmm... can GHC's memory management fragment? If that's the case, you may be seeing GC trying to find free blocks in fragmented memory, and having to re-run the GC cycle to free a block so there's enough contiguous memory. It's a bit of a stretch, but it can happen, and testing that hypothesis would be relatively quick: Run the program with moving GC, observe running time and if it's still slow, check if the GC is actually eating CPU, or if it's merely waiting for other threads to respond to the stop-the-world signal. Regards, Jo

YueCompl

8:27 a.m.

New subject: STM friendly TreeMap (or similar with range scan api) ? WAS: Best ways to achieve throughput, for large M:N ratio of STM threads, with hot TVar updates?

Jo, I have some updates wrt nonmoving GC in another post to the list just now. And per my understanding, GHC's GC doesn't seek free segments within a heap, it instead will copy all live objects to a new heap after then swap the new heap to be the live one, so I assume memory (address space) fragmentation doesn't make much trouble for a GHC process, as for other runtimes. I suspect the difficulty resides in the detection of circular/cyclic circumstances wrt live data structures within the old heap, especially the circles form with arbitrary number of pointers of indirection. If the GC has to perform some dict lookup to decide if an object has been copied to new heap, that's O(n*log(n)) complexity in best case, where n is number of live objects in the heap. To efficiently copy circular structures, one optimization I can imagine is to have a `new ptr` field in every heap object, then in copying another object with a pointer to one object, the `new ptr` can be read out and if not nil, assign the pointer field on another object' in the new heap to that value and it's done; or copy one object' to the new heap, and update the field on one object in the old heap pointing to the new heap. But I don't know details of GHC GC and can't imagine even feasibility of this technique. And even the new nonmoving GC may have similar difficulty to jump out of a circle when following pointers. Regards, Compl

...

On 2020-07-30, at 15:24, Joachim Durchholz wrote:

Am 30.07.20 um 07:31 schrieb Compl Yue via Haskell-Cafe:

...
And now I realize what presuring GC in my situation is not only the large number of pointers (TVars), and at the same time, they form many circular structures, that might be nightmare for a GC.

Cycles are relevant only for reference-counting collectors. As far as I understand http://downloads.haskell.org/~ghc/latest/docs/html/users_guide/runtime_contr..., GHC offers only tracing collectors, and cycles are irrelevant there.

...
I'm still curious why the new non-moving GC in 8.10.1 still don't get obvious business progressing in my situation. I tested it on my Mac yesterday and there I don't know how to see how CPU time is distributed over threads within a process, I'll further test it with some Linux boxes to try understand it better.

Hmm... can GHC's memory management fragment? If that's the case, you may be seeing GC trying to find free blocks in fragmented memory, and having to re-run the GC cycle to free a block so there's enough contiguous memory. It's a bit of a stretch, but it can happen, and testing that hypothesis would be relatively quick: Run the program with moving GC, observe running time and if it's still slow, check if the GC is actually eating CPU, or if it's merely waiting for other threads to respond to the stop-the-world signal.

Regards, Jo _______________________________________________ Haskell-Cafe mailing list To (un)subscribe, modify options or view archives go to: http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe Only members subscribed via the mailman list are allowed to post.

YueCompl

8 a.m.

New subject: STM friendly TreeMap (or similar with range scan api) ? WAS: Best ways to achieve throughput, for large M:N ratio of STM threads, with hot TVar updates?

Update: nonmoving GC does make differences I think couldn't observe it because I set the heap -H2g rather large, and generation 0 are still collected by old moving GC which having difficulty in handling the large hazard heap. After I realize just now that nonmoving GC only works against oldest generation, I tested it again with `+RTS -H16m -A4m` with and without `-xn`, then: Without -xn (old moving GC in effect), the throughput degrades fast and stop business progressing at ~200MB of server RSS With -xn (new nonmvoing GC in effect), server RSS can burst to ~350MB, then throughput degrades relative slower, until RSS reached ~1GB, after then barely progressing at business yielding. But RSS can keep growing with occasional burst fashioned business yield, until ~3.3GB then it totally stuck. Regards, Compl

...

On 2020-07-30, at 13:31, Compl Yue via Haskell-Cafe wrote:

Thanks Ryan, and I'm honored to get Simon's attention.

I did have some worry about package tskiplist, that its github repository seems withdrawn, I emailed the maintainer Peter Robinson lately but have gotten no response by far. What particularly worrying me is the 1st sentence of the Readme has changed from 1.0.0 to 1.0.1 (which is current) as:

...
- This package provides an implementation of a skip list in STM.

...
+ This package provides a proof-of-concept implementation of a skip list in STM

This has to mean something but I can't figure out yet.

Dear Peter Robinson, I hope you can see this message and get in the loop of discussion.

Despite that, I don't think overhead of TVar itself the most serious issue in my situation, as before GC engagement, there are as many TVars being allocated and updated without stuck at business progressing. And now I realize what presuring GC in my situation is not only the large number of pointers (TVars), and at the same time, they form many circular structures, that might be nightmare for a GC. As I model my data after graph model, in my test workload, there are many FeatureSet instances each being an entity/node object, then there are many Feature instances per FeatureSet object, each Feature instance being an unary relationship/edge object, with a reference attribute (via TVar) pointing to the FeatureSet object it belongs to, circular structures form because I maintain an index at each FeatureSet object, sorted by weight etc., but ultimately pointing back (via TVar) to all Feature objects belonging to the set.

I'm still curious why the new non-moving GC in 8.10.1 still don't get obvious business progressing in my situation. I tested it on my Mac yesterday and there I don't know how to see how CPU time is distributed over threads within a process, I'll further test it with some Linux boxes to try understand it better.

Best regards,

Compl

On 2020/7/30 上午10:05, Ryan Yates wrote:

...
Simon, I certainly want to help get to the bottom of the performance issue at hand :D. Sorry if my reply was misleading. The constant factor overhead of pushing `TVar`s into the internal structure may be pressuring unacceptable GC behavior to happen sooner. My impression was that given the same size problem performance loss shifted from synchronization to GC.

Compl, I'm not aware of mutable heap objects being problematic in particular for GHC's GC. There are lots of special cases to handle them of course. I have successfully written Haskell programs that get good performance from the GC with the dominant fraction of heap objects being mutable. I looked a little more at `TSkipList` and one tricky aspect of an STM based skip list is how to manage randomness. In `TSkipList`'s code there is the following comment:

-- | Returns a randomly chosen level. Used for inserting new elements. /O(1)./ <>-- For performance reasons, this function uses 'unsafePerformIO' to access the <>-- random number generator. (It would be possible to store the random number <>-- generator in a 'TVar' and thus be able to access it safely from within the <>-- STM monad. This, however, might cause high contention among threads.) chooseLevel :: TSkipList http://hackage.haskell.org/package/tskiplist-1.0.1/docs/src/Control.Concurre... k http://hackage.haskell.org/package/tskiplist-1.0.1/docs/src/Control.Concurre... a http://hackage.haskell.org/package/tskiplist-1.0.1/docs/src/Control.Concurre... -> Int

This level is chosen on insertion to determine the height of the node. When writing my own STM skiplist I found that the details in unsafely accessing randomness had a significant impact on performance. We went with an unboxed array of PCG states that had an entry for each capability giving constant memory overhead in the number of capabilities. `TSkipList` uses `newStdGen` which involves allocation and synchronization.

Again, I'm not pointing this out to say that this is the entirety of the issue you are encountering, rather, I do think the `TSkipList` library could be improved to allocate much less. Others can speak to how to tell where the time is going in GC (my knowledge of this is likely out of date).

Ryan

On Wed, Jul 29, 2020 at 4:57 PM Simon Peyton Jones mailto:simonpj@microsoft.com> wrote: Compl’s problem is (apparently) that execution becomes dominated by GC. That doesn’t sound like a constant-factor overhead from TVars, no matter how efficient (or otherwise) they are. It sounds more like a space leak to me; perhaps you need some strict evaluation or something.

My point is only: before re-engineering STM it would make sense to get a much more detailed insight into what is actually happening, and where the space and time is going. We have tools to do this (heap profiling, Threadscope, …) but I know they need some skill and insight to use well. But we don’t have nearly enough insight to draw meaningful conclusions yet.

Maybe someone with experience of performance debugging might feel able to help Compl?

Simon

From: Haskell-Cafe mailto:haskell-cafe-bounces@haskell.org> On Behalf Of Ryan Yates Sent: 29 July 2020 20:41 To: YueCompl mailto:compl.yue@icloud.com> Cc: Haskell Cafe mailto:haskell-cafe@haskell.org> Subject: Re: [Haskell-cafe] STM friendly TreeMap (or similar with range scan api) ? WAS: Best ways to achieve throughput, for large M:N ratio of STM threads, with hot TVar updates?

Hi Compl,

There is a lot of overhead with TVars. My thesis work addresses this by incorporating mutable constructor fields with STM. I would like to get all that into GHC as soon as I can :D. I haven't looked closely at the `tskiplist` package, I'll take a look and see if I see any potential issues. There was some recent work on concurrent B-tree that may be interesting to try.

Ryan

On Wed, Jul 29, 2020 at 10:24 AM YueCompl mailto:compl.yue@icloud.com> wrote:

Hi Cafe and Ryan,

I tried Map/Set from stm-containers and TSkipList (added range scan api against its internal data structure) from http://hackage.haskell.org/package/tskiplist https://nam06.safelinks.protection.outlook.com/?url=http%3A%2F%2Fhackage.haskell.org%2Fpackage%2Ftskiplist&data=02%7C01%7Csimonpj%40microsoft.com%7C8ebd68bca55140cebaae08d833f888f2%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637316489838761589&sdata=ZOvJVBqJgdGqx2k%2F49fhZeTYkWAd4GRY%2B8ZxH7cyEkI%3D&reserved=0 , with them I've got quite improved at scalability on concurrency.

But unfortunately then I hit another wall at single thread scalability over working memory size, I suspect it's because massively more TVars (those being pointers per se) are introduced by those "contention-free" data structures, they need to mutate separate pointers concurrently in avoiding contentions anyway, but such pointer-intensive heap seems imposing extraordinary pressure to GHC's garbage collector, that GC will dominate CPU utilization with poor business progress.

For example in my test, I use `+RTS -H2g` for the Haskell server process, so GC is not triggered until after a while, then spin off 3 Python client to insert new records concurrently, in the first stage each Python process happily taking ~90% CPU filling (through local mmap) the arrays allocated from the server and logs of success scroll quickly, while the server process utilizes only 30~40% CPU to serve those 3 clients (insert meta data records into unique indices merely); then the client processes' CPU utilization drop drastically once Haskell server process' private memory reached around 2gb, i.e. GC started engaging, the server process's CPU utilization quickly approaches ~300%, while all client processes' drop to 0% for most of the time, and occasionally burst a tiny while with some log output showing progress. And I disable parallel GC lately, enabling parallel GC only makes it worse.

If I comment out the code updating the indices (those creating many TVars), the overall throughput only drop slowly as more data are inserted, the parallelism feels steady even after the server process' private memory takes several GBs.

I didn't expect this, but appears to me that GC of GHC is really not good at handling massive number of pointers in the heap, while those pointers are essential to reduce contention (and maybe expensive data copying too) at heavy parallelism/concurrency.

Btw I tried `+RTS -xn` with GHC 8.10.1 too, no obvious different behavior compared to 8.8.3; and also tried tweaking GC related RTS options a bit, including increasing -G up to 10, no much difference too.

I feel hopeless at the moment, wondering if I'll have to rewrite this in-memory db in Go/Rust or some other runtime ...

Btw I read https://tech.channable.com/posts/2020-04-07-lessons-in-managing-haskell-memo... https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Ftech.channable.com%2Fposts%2F2020-04-07-lessons-in-managing-haskell-memory.html&data=02%7C01%7Csimonpj%40microsoft.com%7C8ebd68bca55140cebaae08d833f888f2%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637316489838761589&sdata=gqSH82%2FOYRaW4fzBDl%2BLDjhbRA%2BDRE6jaj4k1UI2gFE%3D&reserved=0 in searching about the symptoms, and don't feel likely to convert my DB managed data into immutable types thus to fit into Compact Regions, not quite likely a live in-mem database instance can do.

So seems there are good reasons no successful DBMS, at least in-memory ones have been written in Haskell.

Best regards,

Compl

On 2020-07-25, at 22:07, Ryan Yates mailto:fryguybob@gmail.com> wrote:

Unfortunately my STM benchmarks are rather disorganized. The most relevant paper using them is:

Leveraging hardware TM in Haskell (PPoPP '19)

https://dl.acm.org/doi/10.1145/3293883.3295711 https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdl.acm.org%2Fdoi%2F10.1145%2F3293883.3295711&data=02%7C01%7Csimonpj%40microsoft.com%7C8ebd68bca55140cebaae08d833f888f2%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637316489838771582&sdata=h3po1gPutR%2BsiCST1N0RNkM6irnVL0%2BVbYl3Vs8F8Oc%3D&reserved=0

Or my thesis:

https://urresearch.rochester.edu/institutionalPublicationPublicView.action?i... https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Furresearch.rochester.edu%2FinstitutionalPublicationPublicView.action%3FinstitutionalItemId%3D34931&data=02%7C01%7Csimonpj%40microsoft.com%7C8ebd68bca55140cebaae08d833f888f2%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637316489838771582&sdata=jBQMX5RRajIj0KbLWQCMt%2BMyMJIEmTpSuEHBWpq5Isg%3D&reserved=0

The PPoPP benchmarks are on a branch (or the releases tab on github):

https://github.com/fryguybob/ghc-stm-benchmarks/tree/wip/mutable-fields/benc... https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Ffryguybob%2Fghc-stm-benchmarks%2Ftree%2Fwip%2Fmutable-fields%2Fbenchmarks%2FPPoPP2019%2Fsrc&data=02%7C01%7Csimonpj%40microsoft.com%7C8ebd68bca55140cebaae08d833f888f2%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637316489838771582&sdata=PinsrrGPgAB9TgxH61xngSItw1DcIRf1Niq39b%2BOe0s%3D&reserved=0

All that to say, without an implementation of mutable constructor fields (which I'm working on getting into GHC) the scaling is limited.

Ryan

On Sat, Jul 25, 2020 at 3:45 AM Compl Yue via Haskell-Cafe mailto:haskell-cafe@haskell.org> wrote:

Dear Cafe,

As Chris Allen has suggested, I learned that https://hackage.haskell.org/package/stm-containers https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fhackage.haskell.org%2Fpackage%2Fstm-containers&data=02%7C01%7Csimonpj%40microsoft.com%7C8ebd68bca55140cebaae08d833f888f2%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637316489838781576&sdata=ZwtAltlFRkny5q7M%2B7Pople6c4WA%2Bs8vZhwewUge7eg%3D&reserved=0 and https://hackage.haskell.org/package/ttrie https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fhackage.haskell.org%2Fpackage%2Fttrie&data=02%7C01%7Csimonpj%40microsoft.com%7C8ebd68bca55140cebaae08d833f888f2%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637316489838781576&sdata=zMcZy%2BEzqklkQGjKglCgwg5ZoWyWZIyeRNaCcqtnECs%3D&reserved=0 can help a lot when used in place of traditional HashMap for stm tx processing, under heavy concurrency, yet still with automatic parallelism as GHC implemented them. Then I realized that in addition to hash map (used to implement dicts and scopes), I also need to find a TreeMap replacement data structure to implement the db index. I've been focusing on the uniqueness constraint aspect, but it's still an index, needs to provide range scan api for db clients, so hash map is not sufficient for the index.

I see Ryan shared the code benchmarking RBTree with stm in mind:

https://github.com/fryguybob/ghc-stm-benchmarks/tree/master/benchmarks/RBTre... https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Ffryguybob%2Fghc-stm-benchmarks%2Ftree%2Fmaster%2Fbenchmarks%2FRBTree-Throughput&data=02%7C01%7Csimonpj%40microsoft.com%7C8ebd68bca55140cebaae08d833f888f2%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637316489838791571&sdata=Nl2eN81Kjaf5qyNKEaxxc0ioMw6w4QoX4b5vAE5RaF8%3D&reserved=0 https://github.com/fryguybob/ghc-stm-benchmarks/tree/master/benchmarks/RBTre... https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Ffryguybob%2Fghc-stm-benchmarks%2Ftree%2Fmaster%2Fbenchmarks%2FRBTree&data=02%7C01%7Csimonpj%40microsoft.com%7C8ebd68bca55140cebaae08d833f888f2%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637316489838791571&sdata=%2BLp6HQCyROOlpA2pr8BR8DPls68oY5Y77GKgqbSKmno%3D&reserved=0 But can't find conclusion or interpretation of that benchmark suite. And here's a followup question:

Where are some STM contention optimized data structures, that having keys ordered, with sub-range traversing api ?

(of course production ready libraries most desirable)

Thanks with regards,

Compl

On 2020/7/25 下午2:04, Compl Yue via Haskell-Cafe wrote:

Shame on me for I have neither experienced with `perf`, I'd learn these essential tools soon to put them into good use.

It's great to learn about how `orElse` actually works, I did get confused why there are so little retries captured, and now I know. So that little trick should definitely be removed before going production, as it does no much useful things at excessive cost. I put it there to help me understand internal working of stm, now I get even better knowledge ;-)

I think a debugger will trap every single abort, isn't it annoying when many aborts would occur? If I'd like to count the number of aborts, ideally accounted per service endpoints, time periods, source modules etc. there some tricks for that?

Thanks with best regards,

Compl

On 2020/7/25 上午2:02, Ryan Yates wrote:

To be clear, I was trying to refer to Linux `perf` [^1]. Sampling based profiling can do a good job with concurrent and parallel programs where other methods are problematic. For instance,

changing the size of heap objects can drastically change cache performance and completely different behavior can show up.

[^1]: https://en.wikipedia.org/wiki/Perf_(Linux) https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fen.wikipedia.org%2Fwiki%2FPerf_(Linux)&data=02%7C01%7Csimonpj%40microsoft.com%7C8ebd68bca55140cebaae08d833f888f2%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637316489838801566&sdata=v%2Bv2aVaBITriAM26CqN%2Bp35yshLl%2BbY4BWVEIOSlStA%3D&reserved=0

The spinning in `readTVar` should always be very short and it typically shows up as intensive CPU use, though it may not be high energy use with `pause` in the loop on x86 (looks like we don't have it [^2], I thought we did, but maybe that was only in some of my code... )

[^2]: https://github.com/ghc/ghc/blob/master/rts/STM.c#L1275 https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fghc%2Fghc%2Fblob%2Fmaster%2Frts%2FSTM.c%23L1275&data=02%7C01%7Csimonpj%40microsoft.com%7C8ebd68bca55140cebaae08d833f888f2%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637316489838801566&sdata=YBLmeg4Xxby%2BJJmO8B5etdA6tDpBYOry7jdjEoRFd%2Fk%3D&reserved=0

All that to say, I doubt that you are spending much time spinning (but it would certainly be interesting to know if you are! You would see `perf` attribute a large amount of time to `read_current_value`). The amount of code to execute for commit (the time when locks are held) is always much shorter than it takes to execute the transaction body. As you add more conflicting threads this gets worse of course as commits sequence.

The code you have will count commits of executions of `retry`. Note that `retry` is a user level idea, that is, you are counting user level *explicit* retries. This is different from a transaction failing to commit and starting again. These are invisible to the user. Also using your trace will convert `retry` from the efficient wake on write implementation, to an active retry that will always attempt again. We don't have cheap logging of transaction aborts in GHC, but I have built such logging in my work. You can observe these aborts with a debugger by looking for execution of this line:

https://github.com/ghc/ghc/blob/master/rts/STM.c#L1123 https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fghc%2Fghc%2Fblob%2Fmaster%2Frts%2FSTM.c%23L1123&data=02%7C01%7Csimonpj%40microsoft.com%7C8ebd68bca55140cebaae08d833f888f2%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637316489838811560&sdata=jAEm1CpEYQx6ORikerxVHOSlaOmrTzB3m9EVmOwo%2B8w%3D&reserved=0

Ryan

On Fri, Jul 24, 2020 at 12:35 PM Compl Yue mailto:compl.yue@icloud.com> wrote:

I'm not familiar with profiling GHC yet, may need more time to get myself proficient with it.

And a bit more details of my test workload for diagnostic: the db clients are Python processes from a cluster of worker nodes, consulting the db server to register some path for data files, under a data dir within a shared filesystem, then mmap those data files and fill in actual array data. So the db server don't have much computation to perform, but puts the data file path into a global index, which at the same validates its uniqueness. As there are many client processes trying to insert one meta data record concurrently, with my naive implementation, the global index's TVar will almost always in locked state by one client after another, from a queue never fall empty.

So if `readTVar` should spinning waiting, I doubt the threads should actually make high CPU utilization, because at any instant of time, all threads except the committing one will be doing that one thing.

And I have something in my code to track STM retry like this:

```

-- blocking wait not expected, track stm retries explicitly

trackSTM :: Int -> IO (Either () a)

trackSTM !rtc = do

when -- todo increase the threshold of reporting?

(rtc > 0) $ do

-- trace out the retries so the end users can be aware of them

tid <- myThreadId

trace

( "🔙\n"

<> show callCtx

<> "🌀 "

<> show tid

<> " stm retry #"

<> show rtc

)

$ return ()

atomically ((Just <$> stmJob) `orElse` return Nothing) >>= \case

Nothing -> -- stm failed, do a tracked retry

trackSTM (rtc + 1)

Just ... -> ...

```

No such trace msg fires during my test, neither in single thread run, nor in runs with pressure. I'm sure this tracing mechanism works, as I can see such traces fire, in case e.g. posting a TMVar to a TQueue for some other thread to fill it, then read the result out, if these 2 ops are composed into a single tx, then of course it's infinite retry loop, and a sequence of such msgs are logged with ever increasing rtc #.

So I believe no retry has ever been triggered.

What can going on there?

On 2020/7/24 下午11:46, Ryan Yates wrote:

...
Then to explain the low CPU utilization (~10%), am I right to understand it as that upon reading a TVar locked by another committing tx, a lightweight thread will put itself into `waiting STM` and descheduled state, so the CPUs can only stay idle as not so many threads are willing to proceed?

Since the commit happens in finite steps, the expectation is that the lock will be released very soon. Given this when the body of a transaction executes `readTVar` it spins (active CPU!) until the `TVar` is observed unlocked. If a lock is observed while commiting, it immediately starts the transaction again from the beginning. To get the behavior of suspending a transaction you have to successfully commit a transaction that executed `retry`. Then the transaction is put on the wakeup lists of its read set and subsequent commits will wake it up if its write set overlaps.

I don't think any of these things would explain low CPU utilization. You could try running with `perf` and see if lots of time is spent in some recognizable part of the RTS.

Ryan

On Fri, Jul 24, 2020 at 11:22 AM Compl Yue mailto:compl.yue@icloud.com> wrote:

Thanks very much for the insightful information Ryan! I'm glad my suspect was wrong about the Haskell scheduler:

...
The Haskell capability that is committing a transaction will not yield to another Haskell thread while it is doing the commit. The OS thread may be preempted, but once commit starts the haskell scheduler is not invoked until after locks are released.

So best effort had already been made in GHC and I just need to cooperate better with its design. Then to explain the low CPU utilization (~10%), am I right to understand it as that upon reading a TVar locked by another committing tx, a lightweight thread will put itself into `waiting STM` and descheduled state, so the CPUs can only stay idle as not so many threads are willing to proceed?

Anyway, I see light with better data structures to improve my situation, let me try them and report back. Actually I later changed `TVar (HaskMap k v)` to be `TVar (HashMap k Int)` where the `Int` being array index into `TVar (Vector (TVar (Maybe v)))`, in pursuing insertion order preservation semantic of dict entries (like that in Python 3.7+), then it's very hopeful to incorporate stm-containers' Map or ttrie to approach free of contention.

Thanks with regards,

Compl

On 2020/7/24 下午10:03, Ryan Yates wrote:

Hi Compl,

Having a pool of transaction processing threads can be helpful in a certain way. If the body of the transaction takes more time to execute then the Haskell thread is allowed and it yields, the suspended thread won't get in the way of other thread, but when it is rescheduled, will have a low probability of success. Even worse, it will probably not discover that it is doomed to failure until commit time. If transactions are more likely to reach commit without yielding, they are more likely to succeed. If the transactions are not conflicting, it doesn't make much difference other than cache churn.

The Haskell capability that is committing a transaction will not yield to another Haskell thread while it is doing the commit. The OS thread may be preempted, but once commit starts the haskell scheduler is not invoked until after locks are released.

To get good performance from STM you must pay attention to what TVars are involved in a commit. All STM systems are working under the assumption of low contention, so you want to minimize "false" conflicts (conflicts that are not essential to the computation). Something like `TVar (HashMap k v)` will work pretty well for a low thread count, but every transaction that writes to that structure will be in conflict with every other transaction that accesses it. Pushing the `TVar` into the nodes of the structure reduces the possibilities for conflict, while increasing the amount of bookkeeping STM has to do. I would like to reduce the cost of that bookkeeping using better structures, but we need to do so without harming performance in the low TVar count case. Right now it is optimized for good cache performance with a handful of TVars.

There is another way to play with performance by moving work into and out of the transaction body. A transaction body that executes quickly will reach commit faster. But it may be delaying work that moves into another transaction. Forcing values at the right time can make a big difference.

Ryan

On Fri, Jul 24, 2020 at 2:14 AM Compl Yue via Haskell-Cafe mailto:haskell-cafe@haskell.org> wrote:

Thanks Chris, I confess I didn't pay enough attention to STM specialized container libraries by far, I skimmed through the description of stm-containers and ttrie, and feel they would definitely improve my code's performance in case I limit the server's parallelism within hardware capabilities. That may because I'm still prototyping the api and infrastructure for correctness, so even `TVar (HashMap k v)` performs okay for me at the moment, only if at low contention (surely there're plenty of CPU cycles to be optimized out in next steps). I model my data after graph model, so most data, even most indices are localized to nodes and edges, those can be manipulated without conflict, that's why I assumed I have a low contention use case since the very beginning - until I found there are still (though minor) needs for global indices to guarantee global uniqueness, I feel faithful with stm-containers/ttrie to implement a more scalable global index data structure, thanks for hinting me.

So an evident solution comes into my mind now, is to run the server with a pool of tx processing threads, matching number of CPU cores, client RPC requests then get queued to be executed in some thread from the pool. But I'm really fond of the mechanism of M:N scheduler which solves massive/dynamic concurrency so elegantly. I had some good result with Go in this regard, and see GHC at par in doing this, I don't want to give up this enjoyable machinery.

But looked at the stm implementation in GHC, it seems written TVars are exclusively locked during commit of a tx, I suspect this is the culprit when there're large M lightweight threads scheduled upon a small N hardware capabilities, that is when a lightweight thread yield control during an stm transaction commit, the TVars it locked will stay so until it's scheduled again (and again) till it can finish the commit. This way, descheduled threads could hold live threads from progressing. I haven't gone into more details there, but wonder if there can be some improvement for GHC RTS to keep an stm committing thread from descheduled, but seemingly that may impose more starvation potential; or stm can be improved to have its TVar locks preemptable when the owner trec/thread is in descheduled state? Neither should be easy but I'd really love massive lightweight threads doing STM practically well.

Best regards,

Compl

On 2020/7/24 上午12:57, Christopher Allen wrote:

It seems like you know how to run practical tests for tuning thread count and contention for throughput. Part of the reason you haven't gotten a super clear answer is "it depends." You give up fairness when you use STM instead of MVars or equivalent structures. That means a long running transaction might get stampeded by many small ones invalidating it over and over. The long-running transaction might never clear if the small transactions keep moving the cheese. I mention this because transaction runtime and size and count all affect throughput and latency. What might be ideal for one pattern of work might not be ideal for another. Optimizing for overall throughput might make the contention and fairness problems worse too. I've done practical tests to optimize this in the past, both for STM in Haskell and for RDBMS workloads.

The next step is sometimes figuring out whether you really need a data structure within a single STM container or if perhaps you can break up your STM container boundaries into zones or regions that roughly map onto update boundaries. That should make the transactions churn less. On the outside chance you do need to touch more than one container in a transaction, well, they compose.

e.g. https://hackage.haskell.org/package/stm-containers https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fhackage.haskell.org%2Fpackage%2Fstm-containers&data=02%7C01%7Csimonpj%40microsoft.com%7C8ebd68bca55140cebaae08d833f888f2%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637316489838811560&sdata=Lq1%2BGj0Z6%2BBGMRAZrSzcTAlYgj0B0A67RaQcyyCcXbk%3D&reserved=0 https://hackage.haskell.org/package/ttrie https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fhackage.haskell.org%2Fpackage%2Fttrie&data=02%7C01%7Csimonpj%40microsoft.com%7C8ebd68bca55140cebaae08d833f888f2%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637316489838821555&sdata=PpaiVM2NrPM2HzK0bh%2BMR8YF90yHlxKnN9gwZVQHqR0%3D&reserved=0

It also sounds a bit like your question bumps into Amdahl's Law a bit.

All else fails, stop using STM and find something more tuned to your problem space.

Hope this helps,

Chris Allen

On Thu, Jul 23, 2020 at 9:53 AM YueCompl via Haskell-Cafe mailto:haskell-cafe@haskell.org> wrote:

Hello Cafe,

I'm working on an in-memory database, in Client/Server mode I just let each connected client submit remote procedure call running in its dedicated lightweight thread, modifying TVars in RAM per its business needs, then in case many clients connected concurrently and trying to insert new data, if they are triggering global index (some TVar) update, the throughput would drop drastically. I reduced the shared state to a simple int counter by TVar, got same symptom. While the parallelism feels okay when there's no hot TVar conflicting, or M is not much greater than N.

As an empirical test workload, I have a `+RTS -N10` server process, it handles 10 concurrent clients okay, got ~5x of single thread throughput; but in handling 20 concurrent clients, each of the 10 CPUs can only be driven to ~10% utilization, the throughput seems even worse than single thread. More clients can even drive it thrashing without much progressing.

I can understand that pure STM doesn't scale well after reading [1], and I see it suggested [7] attractive and planned future work toward that direction.

But I can't find certain libraries or frameworks addressing large M over small N scenarios, [1] experimented with designated N parallelism, and [7] is rather theoretical to my empirical needs.

Can you direct me to some available library implementing the methodology proposed in [7] or other ways tackling this problem?

I think the most difficult one is that a transaction should commit with global indices (with possibly unique constraints) atomically updated, and rollback with any violation of constraints, i.e. transactions have to cover global states like indices. Other problems seem more trivial than this.

Specifically, [7] states:

...
It must be emphasized that all of the mechanisms we deploy originate, in one form or another, in the database literature from the 70s and 80s. Our contribution is to adapt these techniques to software transactional memory, providing more effective solutions to important STM problems than prior proposals.

I wonder any STM based library has simplified those techniques to be composed right away? I don't really want to implement those mechanisms by myself, rebuilding many wheels from scratch.

Best regards,

Compl

[1] Comparing the performance of concurrent linked-list implementations in Haskell

https://simonmar.github.io/bib/papers/concurrent-data.pdf https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fsimonmar.github.io%2Fbib%2Fpapers%2Fconcurrent-data.pdf&data=02%7C01%7Csimonpj%40microsoft.com%7C8ebd68bca55140cebaae08d833f888f2%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637316489838821555&sdata=41Jaz8ZRmRfBHyGKxfhJlm4xR7q0pOtJShtO0jTlOwQ%3D&reserved=0

[7] M. Herlihy and E. Koskinen. Transactional boosting: a methodology for highly-concurrent transactional objects. In Proc. of PPoPP ’08, pages 207–216. ACM Press, 2008.

https://www.cs.stevens.edu/~ejk/papers/boosting-ppopp08.pdf https://nam06.safelinks.protection.outlook.com/?url=https:%2F%2Fwww.cs.stevens.edu%2F~ejk%2Fpapers%2Fboosting-ppopp08.pdf&data=02%7C01%7Csimonpj%40microsoft.com%7C8ebd68bca55140cebaae08d833f888f2%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637316489838831548&sdata=ya8Az1oC6f2xoMb90S9HCH57UTQ0nV9sg6SW%2B5JCPC4%3D&reserved=0

_______________________________________________ Haskell-Cafe mailing list To (un)subscribe, modify options or view archives go to: http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe https://nam06.safelinks.protection.outlook.com/?url=http%3A%2F%2Fmail.haskell.org%2Fcgi-bin%2Fmailman%2Flistinfo%2Fhaskell-cafe&data=02%7C01%7Csimonpj%40microsoft.com%7C8ebd68bca55140cebaae08d833f888f2%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637316489838831548&sdata=c2AV7CO42o3tcw0EuMzqedKkBCtQjWjvdMoUsb4llbY%3D&reserved=0 Only members subscribed via the mailman list are allowed to post.

--

Chris Allen

Currently working on http://haskellbook.com https://nam06.safelinks.protection.outlook.com/?url=http%3A%2F%2Fhaskellbook.com%2F&data=02%7C01%7Csimonpj%40microsoft.com%7C8ebd68bca55140cebaae08d833f888f2%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637316489838831548&sdata=tIHFQFZPIQgRp8oqGRvyebm1YQdCvGD0VoMcflzJwKc%3D&reserved=0 _______________________________________________ Haskell-Cafe mailing list To (un)subscribe, modify options or view archives go to: http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe https://nam06.safelinks.protection.outlook.com/?url=http%3A%2F%2Fmail.haskell.org%2Fcgi-bin%2Fmailman%2Flistinfo%2Fhaskell-cafe&data=02%7C01%7Csimonpj%40microsoft.com%7C8ebd68bca55140cebaae08d833f888f2%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637316489838841547&sdata=vdzv5WBA62cNwO6DA1D4KEHDCweyOerpn1PdMK0A%2BHw%3D&reserved=0 Only members subscribed via the mailman list are allowed to post.

_______________________________________________ Haskell-Cafe mailing list To (un)subscribe, modify options or view archives go to: http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe https://nam06.safelinks.protection.outlook.com/?url=http%3A%2F%2Fmail.haskell.org%2Fcgi-bin%2Fmailman%2Flistinfo%2Fhaskell-cafe&data=02%7C01%7Csimonpj%40microsoft.com%7C8ebd68bca55140cebaae08d833f888f2%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637316489838841547&sdata=vdzv5WBA62cNwO6DA1D4KEHDCweyOerpn1PdMK0A%2BHw%3D&reserved=0 Only members subscribed via the mailman list are allowed to post. _______________________________________________ Haskell-Cafe mailing list To (un)subscribe, modify options or view archives go to: http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe https://nam06.safelinks.protection.outlook.com/?url=http%3A%2F%2Fmail.haskell.org%2Fcgi-bin%2Fmailman%2Flistinfo%2Fhaskell-cafe&data=02%7C01%7Csimonpj%40microsoft.com%7C8ebd68bca55140cebaae08d833f888f2%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637316489838851540&sdata=Btpa3sjfAjTf2ICO0QpQG5vVCawIjERNjUHji06uG5Y%3D&reserved=0 Only members subscribed via the mailman list are allowed to post.

_______________________________________________ Haskell-Cafe mailing list To (un)subscribe, modify options or view archives go to: http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe https://nam06.safelinks.protection.outlook.com/?url=http%3A%2F%2Fmail.haskell.org%2Fcgi-bin%2Fmailman%2Flistinfo%2Fhaskell-cafe&data=02%7C01%7Csimonpj%40microsoft.com%7C8ebd68bca55140cebaae08d833f888f2%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637316489838851540&sdata=Btpa3sjfAjTf2ICO0QpQG5vVCawIjERNjUHji06uG5Y%3D&reserved=0 Only members subscribed via the mailman list are allowed to post.

_______________________________________________ Haskell-Cafe mailing list To (un)subscribe, modify options or view archives go to: http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe Only members subscribed via the mailman list are allowed to post.

Merijn Verstraaten

8:32 a.m.

New subject: STM friendly TreeMap (or similar with range scan api) ? WAS: Best ways to achieve throughput, for large M:N ratio of STM threads, with hot TVar updates?

What I haven't seen anyone mention/ask yet is: Are you using the threaded runtime? (Presumably yes) And are you using high numbers of capabilities? (Like +RTS -N), because that will enable parallel GC, which has notoriously poor behaviour with default settings and high numbers of capabilities? I've seen 2 order of magnitude speedups in my own code by disabling the parallel GC in the threaded runtime. Cheers, Merijn

...

On 30 Jul 2020, at 10:00, YueCompl via Haskell-Cafe wrote:

Update: nonmoving GC does make differences

I think couldn't observe it because I set the heap -H2g rather large, and generation 0 are still collected by old moving GC which having difficulty in handling the large hazard heap. After I realize just now that nonmoving GC only works against oldest generation, I tested it again with `+RTS -H16m -A4m` with and without `-xn`, then:

Without -xn (old moving GC in effect), the throughput degrades fast and stop business progressing at ~200MB of server RSS

With -xn (new nonmvoing GC in effect), server RSS can burst to ~350MB, then throughput degrades relative slower, until RSS reached ~1GB, after then barely progressing at business yielding. But RSS can keep growing with occasional burst fashioned business yield, until ~3.3GB then it totally stuck.

Regards, Compl

...
On 2020-07-30, at 13:31, Compl Yue via Haskell-Cafe wrote:

Thanks Ryan, and I'm honored to get Simon's attention.

I did have some worry about package tskiplist, that its github repository seems withdrawn, I emailed the maintainer Peter Robinson lately but have gotten no response by far. What particularly worrying me is the 1st sentence of the Readme has changed from 1.0.0 to 1.0.1 (which is current) as:

...
- This package provides an implementation of a skip list in STM.

...
+ This package provides a proof-of-concept implementation of a skip list in STM

This has to mean something but I can't figure out yet.

Dear Peter Robinson, I hope you can see this message and get in the loop of discussion.

Despite that, I don't think overhead of TVar itself the most serious issue in my situation, as before GC engagement, there are as many TVars being allocated and updated without stuck at business progressing. And now I realize what presuring GC in my situation is not only the large number of pointers (TVars), and at the same time, they form many circular structures, that might be nightmare for a GC. As I model my data after graph model, in my test workload, there are many FeatureSet instances each being an entity/node object, then there are many Feature instances per FeatureSet object, each Feature instance being an unary relationship/edge object, with a reference attribute (via TVar) pointing to the FeatureSet object it belongs to, circular structures form because I maintain an index at each FeatureSet object, sorted by weight etc., but ultimately pointing back (via TVar) to all Feature objects belonging to the set.

I'm still curious why the new non-moving GC in 8.10.1 still don't get obvious business progressing in my situation. I tested it on my Mac yesterday and there I don't know how to see how CPU time is distributed over threads within a process, I'll further test it with some Linux boxes to try understand it better.

Best regards,

Compl

On 2020/7/30 上午10:05, Ryan Yates wrote:

...
Simon, I certainly want to help get to the bottom of the performance issue at hand :D. Sorry if my reply was misleading. The constant factor overhead of pushing `TVar`s into the internal structure may be pressuring unacceptable GC behavior to happen sooner. My impression was that given the same size problem performance loss shifted from synchronization to GC.

Compl, I'm not aware of mutable heap objects being problematic in particular for GHC's GC. There are lots of special cases to handle them of course. I have successfully written Haskell programs that get good performance from the GC with the dominant fraction of heap objects being mutable. I looked a little more at `TSkipList` and one tricky aspect of an STM based skip list is how to manage randomness. In `TSkipList`'s code there is the following comment:

-- | Returns a randomly chosen level. Used for inserting new elements. /O(1)./ -- For performance reasons, this function uses 'unsafePerformIO' to access the -- random number generator. (It would be possible to store the random number -- generator in a 'TVar' and thus be able to access it safely from within the -- STM monad. This, however, might cause high contention among threads.) chooseLevel :: TSkipList k a -> Int

This level is chosen on insertion to determine the height of the node. When writing my own STM skiplist I found that the details in unsafely accessing randomness had a significant impact on performance. We went with an unboxed array of PCG states that had an entry for each capability giving constant memory overhead in the number of capabilities. `TSkipList` uses `newStdGen` which involves allocation and synchronization.

Again, I'm not pointing this out to say that this is the entirety of the issue you are encountering, rather, I do think the `TSkipList` library could be improved to allocate much less. Others can speak to how to tell where the time is going in GC (my knowledge of this is likely out of date).

Ryan

On Wed, Jul 29, 2020 at 4:57 PM Simon Peyton Jones wrote: Compl’s problem is (apparently) that execution becomes dominated by GC. That doesn’t sound like a constant-factor overhead from TVars, no matter how efficient (or otherwise) they are. It sounds more like a space leak to me; perhaps you need some strict evaluation or something.

My point is only: before re-engineering STM it would make sense to get a much more detailed insight into what is actually happening, and where the space and time is going. We have tools to do this (heap profiling, Threadscope, …) but I know they need some skill and insight to use well. But we don’t have nearly enough insight to draw meaningful conclusions yet.

Maybe someone with experience of performance debugging might feel able to help Compl?

Simon

From: Haskell-Cafe On Behalf Of Ryan Yates Sent: 29 July 2020 20:41 To: YueCompl Cc: Haskell Cafe Subject: Re: [Haskell-cafe] STM friendly TreeMap (or similar with range scan api) ? WAS: Best ways to achieve throughput, for large M:N ratio of STM threads, with hot TVar updates?

Hi Compl,

There is a lot of overhead with TVars. My thesis work addresses this by incorporating mutable constructor fields with STM. I would like to get all that into GHC as soon as I can :D. I haven't looked closely at the `tskiplist` package, I'll take a look and see if I see any potential issues. There was some recent work on concurrent B-tree that may be interesting to try.

Ryan

On Wed, Jul 29, 2020 at 10:24 AM YueCompl wrote:

Hi Cafe and Ryan,

I tried Map/Set from stm-containers and TSkipList (added range scan api against its internal data structure) from http://hackage.haskell.org/package/tskiplist , with them I've got quite improved at scalability on concurrency.

But unfortunately then I hit another wall at single thread scalability over working memory size, I suspect it's because massively more TVars (those being pointers per se) are introduced by those "contention-free" data structures, they need to mutate separate pointers concurrently in avoiding contentions anyway, but such pointer-intensive heap seems imposing extraordinary pressure to GHC's garbage collector, that GC will dominate CPU utilization with poor business progress.

For example in my test, I use `+RTS -H2g` for the Haskell server process, so GC is not triggered until after a while, then spin off 3 Python client to insert new records concurrently, in the first stage each Python process happily taking ~90% CPU filling (through local mmap) the arrays allocated from the server and logs of success scroll quickly, while the server process utilizes only 30~40% CPU to serve those 3 clients (insert meta data records into unique indices merely); then the client processes' CPU utilization drop drastically once Haskell server process' private memory reached around 2gb, i.e. GC started engaging, the server process's CPU utilization quickly approaches ~300%, while all client processes' drop to 0% for most of the time, and occasionally burst a tiny while with some log output showing progress. And I disable parallel GC lately, enabling parallel GC only makes it worse.

If I comment out the code updating the indices (those creating many TVars), the overall throughput only drop slowly as more data are inserted, the parallelism feels steady even after the server process' private memory takes several GBs.

I didn't expect this, but appears to me that GC of GHC is really not good at handling massive number of pointers in the heap, while those pointers are essential to reduce contention (and maybe expensive data copying too) at heavy parallelism/concurrency.

Btw I tried `+RTS -xn` with GHC 8.10.1 too, no obvious different behavior compared to 8.8.3; and also tried tweaking GC related RTS options a bit, including increasing -G up to 10, no much difference too.

I feel hopeless at the moment, wondering if I'll have to rewrite this in-memory db in Go/Rust or some other runtime ...

Btw I read https://tech.channable.com/posts/2020-04-07-lessons-in-managing-haskell-memo... in searching about the symptoms, and don't feel likely to convert my DB managed data into immutable types thus to fit into Compact Regions, not quite likely a live in-mem database instance can do.

So seems there are good reasons no successful DBMS, at least in-memory ones have been written in Haskell.

Best regards,

Compl

On 2020-07-25, at 22:07, Ryan Yates wrote:

Unfortunately my STM benchmarks are rather disorganized. The most relevant paper using them is:

Leveraging hardware TM in Haskell (PPoPP '19)

https://dl.acm.org/doi/10.1145/3293883.3295711

Or my thesis:

https://urresearch.rochester.edu/institutionalPublicationPublicView.action?i...

The PPoPP benchmarks are on a branch (or the releases tab on github):

https://github.com/fryguybob/ghc-stm-benchmarks/tree/wip/mutable-fields/benc...

All that to say, without an implementation of mutable constructor fields (which I'm working on getting into GHC) the scaling is limited.

Ryan

On Sat, Jul 25, 2020 at 3:45 AM Compl Yue via Haskell-Cafe wrote:

Dear Cafe,

As Chris Allen has suggested, I learned that https://hackage.haskell.org/package/stm-containers and https://hackage.haskell.org/package/ttrie can help a lot when used in place of traditional HashMap for stm tx processing, under heavy concurrency, yet still with automatic parallelism as GHC implemented them. Then I realized that in addition to hash map (used to implement dicts and scopes), I also need to find a TreeMap replacement data structure to implement the db index. I've been focusing on the uniqueness constraint aspect, but it's still an index, needs to provide range scan api for db clients, so hash map is not sufficient for the index.

I see Ryan shared the code benchmarking RBTree with stm in mind:

https://github.com/fryguybob/ghc-stm-benchmarks/tree/master/benchmarks/RBTre...

https://github.com/fryguybob/ghc-stm-benchmarks/tree/master/benchmarks/RBTre...

But can't find conclusion or interpretation of that benchmark suite. And here's a followup question:

Where are some STM contention optimized data structures, that having keys ordered, with sub-range traversing api ?

(of course production ready libraries most desirable)

Thanks with regards,

Compl

On 2020/7/25 下午2:04, Compl Yue via Haskell-Cafe wrote:

Shame on me for I have neither experienced with `perf`, I'd learn these essential tools soon to put them into good use.

It's great to learn about how `orElse` actually works, I did get confused why there are so little retries captured, and now I know. So that little trick should definitely be removed before going production, as it does no much useful things at excessive cost. I put it there to help me understand internal working of stm, now I get even better knowledge ;-)

I think a debugger will trap every single abort, isn't it annoying when many aborts would occur? If I'd like to count the number of aborts, ideally accounted per service endpoints, time periods, source modules etc. there some tricks for that?

Thanks with best regards,

Compl

On 2020/7/25 上午2:02, Ryan Yates wrote:

To be clear, I was trying to refer to Linux `perf` [^1]. Sampling based profiling can do a good job with concurrent and parallel programs where other methods are problematic. For instance,

changing the size of heap objects can drastically change cache performance and completely different behavior can show up.

[^1]: https://en.wikipedia.org/wiki/Perf_(Linux)

The spinning in `readTVar` should always be very short and it typically shows up as intensive CPU use, though it may not be high energy use with `pause` in the loop on x86 (looks like we don't have it [^2], I thought we did, but maybe that was only in some of my code... )

[^2]: https://github.com/ghc/ghc/blob/master/rts/STM.c#L1275

All that to say, I doubt that you are spending much time spinning (but it would certainly be interesting to know if you are! You would see `perf` attribute a large amount of time to `read_current_value`). The amount of code to execute for commit (the time when locks are held) is always much shorter than it takes to execute the transaction body. As you add more conflicting threads this gets worse of course as commits sequence.

The code you have will count commits of executions of `retry`. Note that `retry` is a user level idea, that is, you are counting user level *explicit* retries. This is different from a transaction failing to commit and starting again. These are invisible to the user. Also using your trace will convert `retry` from the efficient wake on write implementation, to an active retry that will always attempt again. We don't have cheap logging of transaction aborts in GHC, but I have built such logging in my work. You can observe these aborts with a debugger by looking for execution of this line:

https://github.com/ghc/ghc/blob/master/rts/STM.c#L1123

Ryan

On Fri, Jul 24, 2020 at 12:35 PM Compl Yue wrote:

I'm not familiar with profiling GHC yet, may need more time to get myself proficient with it.

And a bit more details of my test workload for diagnostic: the db clients are Python processes from a cluster of worker nodes, consulting the db server to register some path for data files, under a data dir within a shared filesystem, then mmap those data files and fill in actual array data. So the db server don't have much computation to perform, but puts the data file path into a global index, which at the same validates its uniqueness. As there are many client processes trying to insert one meta data record concurrently, with my naive implementation, the global index's TVar will almost always in locked state by one client after another, from a queue never fall empty.

So if `readTVar` should spinning waiting, I doubt the threads should actually make high CPU utilization, because at any instant of time, all threads except the committing one will be doing that one thing.

And I have something in my code to track STM retry like this:

```

-- blocking wait not expected, track stm retries explicitly

trackSTM :: Int -> IO (Either () a)

trackSTM !rtc = do

when -- todo increase the threshold of reporting?

(rtc > 0) $ do

-- trace out the retries so the end users can be aware of them

tid <- myThreadId

trace

( "🔙\n"

<> show callCtx

<> "🌀 "

<> show tid

<> " stm retry #"

<> show rtc

)

$ return ()

atomically ((Just <$> stmJob) `orElse` return Nothing) >>= \case

Nothing -> -- stm failed, do a tracked retry

trackSTM (rtc + 1)

Just ... -> ...

```

No such trace msg fires during my test, neither in single thread run, nor in runs with pressure. I'm sure this tracing mechanism works, as I can see such traces fire, in case e.g. posting a TMVar to a TQueue for some other thread to fill it, then read the result out, if these 2 ops are composed into a single tx, then of course it's infinite retry loop, and a sequence of such msgs are logged with ever increasing rtc #.

So I believe no retry has ever been triggered.

What can going on there?

On 2020/7/24 下午11:46, Ryan Yates wrote:

...
Then to explain the low CPU utilization (~10%), am I right to understand it as that upon reading a TVar locked by another committing tx, a lightweight thread will put itself into `waiting STM` and descheduled state, so the CPUs can only stay idle as not so many threads are willing to proceed?

Since the commit happens in finite steps, the expectation is that the lock will be released very soon. Given this when the body of a transaction executes `readTVar` it spins (active CPU!) until the `TVar` is observed unlocked. If a lock is observed while commiting, it immediately starts the transaction again from the beginning. To get the behavior of suspending a transaction you have to successfully commit a transaction that executed `retry`. Then the transaction is put on the wakeup lists of its read set and subsequent commits will wake it up if its write set overlaps.

I don't think any of these things would explain low CPU utilization. You could try running with `perf` and see if lots of time is spent in some recognizable part of the RTS.

Ryan

On Fri, Jul 24, 2020 at 11:22 AM Compl Yue wrote:

Thanks very much for the insightful information Ryan! I'm glad my suspect was wrong about the Haskell scheduler:

...
The Haskell capability that is committing a transaction will not yield to another Haskell thread while it is doing the commit. The OS thread may be preempted, but once commit starts the haskell scheduler is not invoked until after locks are released.

So best effort had already been made in GHC and I just need to cooperate better with its design. Then to explain the low CPU utilization (~10%), am I right to understand it as that upon reading a TVar locked by another committing tx, a lightweight thread will put itself into `waiting STM` and descheduled state, so the CPUs can only stay idle as not so many threads are willing to proceed?

Anyway, I see light with better data structures to improve my situation, let me try them and report back. Actually I later changed `TVar (HaskMap k v)` to be `TVar (HashMap k Int)` where the `Int` being array index into `TVar (Vector (TVar (Maybe v)))`, in pursuing insertion order preservation semantic of dict entries (like that in Python 3.7+), then it's very hopeful to incorporate stm-containers' Map or ttrie to approach free of contention.

Thanks with regards,

Compl

On 2020/7/24 下午10:03, Ryan Yates wrote:

Hi Compl,

Having a pool of transaction processing threads can be helpful in a certain way. If the body of the transaction takes more time to execute then the Haskell thread is allowed and it yields, the suspended thread won't get in the way of other thread, but when it is rescheduled, will have a low probability of success. Even worse, it will probably not discover that it is doomed to failure until commit time. If transactions are more likely to reach commit without yielding, they are more likely to succeed. If the transactions are not conflicting, it doesn't make much difference other than cache churn.

The Haskell capability that is committing a transaction will not yield to another Haskell thread while it is doing the commit. The OS thread may be preempted, but once commit starts the haskell scheduler is not invoked until after locks are released.

To get good performance from STM you must pay attention to what TVars are involved in a commit. All STM systems are working under the assumption of low contention, so you want to minimize "false" conflicts (conflicts that are not essential to the computation). Something like `TVar (HashMap k v)` will work pretty well for a low thread count, but every transaction that writes to that structure will be in conflict with every other transaction that accesses it. Pushing the `TVar` into the nodes of the structure reduces the possibilities for conflict, while increasing the amount of bookkeeping STM has to do. I would like to reduce the cost of that bookkeeping using better structures, but we need to do so without harming performance in the low TVar count case. Right now it is optimized for good cache performance with a handful of TVars.

There is another way to play with performance by moving work into and out of the transaction body. A transaction body that executes quickly will reach commit faster. But it may be delaying work that moves into another transaction. Forcing values at the right time can make a big difference.

Ryan

On Fri, Jul 24, 2020 at 2:14 AM Compl Yue via Haskell-Cafe wrote:

Thanks Chris, I confess I didn't pay enough attention to STM specialized container libraries by far, I skimmed through the description of stm-containers and ttrie, and feel they would definitely improve my code's performance in case I limit the server's parallelism within hardware capabilities. That may because I'm still prototyping the api and infrastructure for correctness, so even `TVar (HashMap k v)` performs okay for me at the moment, only if at low contention (surely there're plenty of CPU cycles to be optimized out in next steps). I model my data after graph model, so most data, even most indices are localized to nodes and edges, those can be manipulated without conflict, that's why I assumed I have a low contention use case since the very beginning - until I found there are still (though minor) needs for global indices to guarantee global uniqueness, I feel faithful with stm-containers/ttrie to implement a more scalable global index data structure, thanks for hinting me.

So an evident solution comes into my mind now, is to run the server with a pool of tx processing threads, matching number of CPU cores, client RPC requests then get queued to be executed in some thread from the pool. But I'm really fond of the mechanism of M:N scheduler which solves massive/dynamic concurrency so elegantly. I had some good result with Go in this regard, and see GHC at par in doing this, I don't want to give up this enjoyable machinery.

But looked at the stm implementation in GHC, it seems written TVars are exclusively locked during commit of a tx, I suspect this is the culprit when there're large M lightweight threads scheduled upon a small N hardware capabilities, that is when a lightweight thread yield control during an stm transaction commit, the TVars it locked will stay so until it's scheduled again (and again) till it can finish the commit. This way, descheduled threads could hold live threads from progressing. I haven't gone into more details there, but wonder if there can be some improvement for GHC RTS to keep an stm committing thread from descheduled, but seemingly that may impose more starvation potential; or stm can be improved to have its TVar locks preemptable when the owner trec/thread is in descheduled state? Neither should be easy but I'd really love massive lightweight threads doing STM practically well.

Best regards,

Compl

On 2020/7/24 上午12:57, Christopher Allen wrote:

It seems like you know how to run practical tests for tuning thread count and contention for throughput. Part of the reason you haven't gotten a super clear answer is "it depends." You give up fairness when you use STM instead of MVars or equivalent structures. That means a long running transaction might get stampeded by many small ones invalidating it over and over. The long-running transaction might never clear if the small transactions keep moving the cheese. I mention this because transaction runtime and size and count all affect throughput and latency. What might be ideal for one pattern of work might not be ideal for another. Optimizing for overall throughput might make the contention and fairness problems worse too. I've done practical tests to optimize this in the past, both for STM in Haskell and for RDBMS workloads.

The next step is sometimes figuring out whether you really need a data structure within a single STM container or if perhaps you can break up your STM container boundaries into zones or regions that roughly map onto update boundaries. That should make the transactions churn less. On the outside chance you do need to touch more than one container in a transaction, well, they compose.

e.g. https://hackage.haskell.org/package/stm-containers

https://hackage.haskell.org/package/ttrie

It also sounds a bit like your question bumps into Amdahl's Law a bit.

All else fails, stop using STM and find something more tuned to your problem space.

Hope this helps,

Chris Allen

On Thu, Jul 23, 2020 at 9:53 AM YueCompl via Haskell-Cafe wrote:

Hello Cafe,

I'm working on an in-memory database, in Client/Server mode I just let each connected client submit remote procedure call running in its dedicated lightweight thread, modifying TVars in RAM per its business needs, then in case many clients connected concurrently and trying to insert new data, if they are triggering global index (some TVar) update, the throughput would drop drastically. I reduced the shared state to a simple int counter by TVar, got same symptom. While the parallelism feels okay when there's no hot TVar conflicting, or M is not much greater than N.

As an empirical test workload, I have a `+RTS -N10` server process, it handles 10 concurrent clients okay, got ~5x of single thread throughput; but in handling 20 concurrent clients, each of the 10 CPUs can only be driven to ~10% utilization, the throughput seems even worse than single thread. More clients can even drive it thrashing without much progressing.

I can understand that pure STM doesn't scale well after reading [1], and I see it suggested [7] attractive and planned future work toward that direction.

But I can't find certain libraries or frameworks addressing large M over small N scenarios, [1] experimented with designated N parallelism, and [7] is rather theoretical to my empirical needs.

Can you direct me to some available library implementing the methodology proposed in [7] or other ways tackling this problem?

I think the most difficult one is that a transaction should commit with global indices (with possibly unique constraints) atomically updated, and rollback with any violation of constraints, i.e. transactions have to cover global states like indices. Other problems seem more trivial than this.

Specifically, [7] states:

...
It must be emphasized that all of the mechanisms we deploy originate, in one form or another, in the database literature from the 70s and 80s. Our contribution is to adapt these techniques to software transactional memory, providing more effective solutions to important STM problems than prior proposals.

I wonder any STM based library has simplified those techniques to be composed right away? I don't really want to implement those mechanisms by myself, rebuilding many wheels from scratch.

Best regards,

Compl

[1] Comparing the performance of concurrent linked-list implementations in Haskell

https://simonmar.github.io/bib/papers/concurrent-data.pdf

[7] M. Herlihy and E. Koskinen. Transactional boosting: a methodology for highly-concurrent transactional objects. In Proc. of PPoPP ’08, pages 207–216. ACM Press, 2008.

https://www.cs.stevens.edu/~ejk/papers/boosting-ppopp08.pdf

_______________________________________________ Haskell-Cafe mailing list To (un)subscribe, modify options or view archives go to: http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe Only members subscribed via the mailman list are allowed to post.

--

Chris Allen

Currently working on http://haskellbook.com

_______________________________________________ Haskell-Cafe mailing list To (un)subscribe, modify options or view archives go to: http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe Only members subscribed via the mailman list are allowed to post.

_______________________________________________ Haskell-Cafe mailing list To (un)subscribe, modify options or view archives go to: http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe Only members subscribed via the mailman list are allowed to post. _______________________________________________ Haskell-Cafe mailing list To (un)subscribe, modify options or view archives go to: http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe Only members subscribed via the mailman list are allowed to post.

_______________________________________________ Haskell-Cafe mailing list To (un)subscribe, modify options or view archives go to: http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe Only members subscribed via the mailman list are allowed to post.

_______________________________________________ Haskell-Cafe mailing list To (un)subscribe, modify options or view archives go to: http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe Only members subscribed via the mailman list are allowed to post.

_______________________________________________ Haskell-Cafe mailing list To (un)subscribe, modify options or view archives go to: http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe Only members subscribed via the mailman list are allowed to post.

YueCompl

8:59 a.m.

New subject: STM friendly TreeMap (or similar with range scan api) ? WAS: Best ways to achieve throughput, for large M:N ratio of STM threads, with hot TVar updates?

Hi Merijn, Yes I always use -threaded even for single thread test. I did my tests with `+RTS -N10 -A128m -qg -I0` by default, and tinkered with `-qn5 -qb1 -qg1`, `-G3`, `-G5`, even `-G10` and some slightly tuned combinations, all with no apparent improvement. And yes I've discovered parallel GC terribly affecting the throughput (easily thrashing with high number of concurrent driving clients, inducing high portion of kernel CPU utilization with little business progress), so lately I prefer to disable it `-qg` or at least limit number of participant capabilities with `-qn1` ~ `-qn5`. Btw, it feels like once an RTS option is added by `-with-rtsopts=` at compile time, the same option can not be overridden from command line, I had thought command line `+RTS xx` will always take highest precedence and override other sources (I see env var GHCRTS documented but haven't used it yet), but appears compile time `-with-rtsopts=` is final, so I lately compile only with ghc-options: -Wall -threaded -rtsopts in the `executable` section of my .cabal file, and test various RTS options on command line per each run. Thanks with regards, Compl

...

On 2020-07-30, at 16:32, Merijn Verstraaten wrote:

What I haven't seen anyone mention/ask yet is: Are you using the threaded runtime? (Presumably yes) And are you using high numbers of capabilities? (Like +RTS -N), because that will enable parallel GC, which has notoriously poor behaviour with default settings and high numbers of capabilities?

I've seen 2 order of magnitude speedups in my own code by disabling the parallel GC in the threaded runtime.

Cheers, Merijn

...
On 30 Jul 2020, at 10:00, YueCompl via Haskell-Cafe wrote:

Update: nonmoving GC does make differences

I think couldn't observe it because I set the heap -H2g rather large, and generation 0 are still collected by old moving GC which having difficulty in handling the large hazard heap. After I realize just now that nonmoving GC only works against oldest generation, I tested it again with `+RTS -H16m -A4m` with and without `-xn`, then:

Without -xn (old moving GC in effect), the throughput degrades fast and stop business progressing at ~200MB of server RSS

With -xn (new nonmvoing GC in effect), server RSS can burst to ~350MB, then throughput degrades relative slower, until RSS reached ~1GB, after then barely progressing at business yielding. But RSS can keep growing with occasional burst fashioned business yield, until ~3.3GB then it totally stuck.

Regards, Compl

...
On 2020-07-30, at 13:31, Compl Yue via Haskell-Cafe wrote:

Thanks Ryan, and I'm honored to get Simon's attention.

I did have some worry about package tskiplist, that its github repository seems withdrawn, I emailed the maintainer Peter Robinson lately but have gotten no response by far. What particularly worrying me is the 1st sentence of the Readme has changed from 1.0.0 to 1.0.1 (which is current) as:

...
- This package provides an implementation of a skip list in STM.

...
+ This package provides a proof-of-concept implementation of a skip list in STM

This has to mean something but I can't figure out yet.

Dear Peter Robinson, I hope you can see this message and get in the loop of discussion.

Despite that, I don't think overhead of TVar itself the most serious issue in my situation, as before GC engagement, there are as many TVars being allocated and updated without stuck at business progressing. And now I realize what presuring GC in my situation is not only the large number of pointers (TVars), and at the same time, they form many circular structures, that might be nightmare for a GC. As I model my data after graph model, in my test workload, there are many FeatureSet instances each being an entity/node object, then there are many Feature instances per FeatureSet object, each Feature instance being an unary relationship/edge object, with a reference attribute (via TVar) pointing to the FeatureSet object it belongs to, circular structures form because I maintain an index at each FeatureSet object, sorted by weight etc., but ultimately pointing back (via TVar) to all Feature objects belonging to the set.

I'm still curious why the new non-moving GC in 8.10.1 still don't get obvious business progressing in my situation. I tested it on my Mac yesterday and there I don't know how to see how CPU time is distributed over threads within a process, I'll further test it with some Linux boxes to try understand it better.

Best regards,

Compl

On 2020/7/30 上午10:05, Ryan Yates wrote:

...
Simon, I certainly want to help get to the bottom of the performance issue at hand :D. Sorry if my reply was misleading. The constant factor overhead of pushing `TVar`s into the internal structure may be pressuring unacceptable GC behavior to happen sooner. My impression was that given the same size problem performance loss shifted from synchronization to GC.

Compl, I'm not aware of mutable heap objects being problematic in particular for GHC's GC. There are lots of special cases to handle them of course. I have successfully written Haskell programs that get good performance from the GC with the dominant fraction of heap objects being mutable. I looked a little more at `TSkipList` and one tricky aspect of an STM based skip list is how to manage randomness. In `TSkipList`'s code there is the following comment:

-- | Returns a randomly chosen level. Used for inserting new elements. /O(1)./ -- For performance reasons, this function uses 'unsafePerformIO' to access the -- random number generator. (It would be possible to store the random number -- generator in a 'TVar' and thus be able to access it safely from within the -- STM monad. This, however, might cause high contention among threads.) chooseLevel :: TSkipList k a -> Int

This level is chosen on insertion to determine the height of the node. When writing my own STM skiplist I found that the details in unsafely accessing randomness had a significant impact on performance. We went with an unboxed array of PCG states that had an entry for each capability giving constant memory overhead in the number of capabilities. `TSkipList` uses `newStdGen` which involves allocation and synchronization.

Again, I'm not pointing this out to say that this is the entirety of the issue you are encountering, rather, I do think the `TSkipList` library could be improved to allocate much less. Others can speak to how to tell where the time is going in GC (my knowledge of this is likely out of date).

Ryan

On Wed, Jul 29, 2020 at 4:57 PM Simon Peyton Jones wrote: Compl’s problem is (apparently) that execution becomes dominated by GC. That doesn’t sound like a constant-factor overhead from TVars, no matter how efficient (or otherwise) they are. It sounds more like a space leak to me; perhaps you need some strict evaluation or something.

My point is only: before re-engineering STM it would make sense to get a much more detailed insight into what is actually happening, and where the space and time is going. We have tools to do this (heap profiling, Threadscope, …) but I know they need some skill and insight to use well. But we don’t have nearly enough insight to draw meaningful conclusions yet.

Maybe someone with experience of performance debugging might feel able to help Compl?

Simon

From: Haskell-Cafe On Behalf Of Ryan Yates Sent: 29 July 2020 20:41 To: YueCompl Cc: Haskell Cafe Subject: Re: [Haskell-cafe] STM friendly TreeMap (or similar with range scan api) ? WAS: Best ways to achieve throughput, for large M:N ratio of STM threads, with hot TVar updates?

Hi Compl,

There is a lot of overhead with TVars. My thesis work addresses this by incorporating mutable constructor fields with STM. I would like to get all that into GHC as soon as I can :D. I haven't looked closely at the `tskiplist` package, I'll take a look and see if I see any potential issues. There was some recent work on concurrent B-tree that may be interesting to try.

Ryan

On Wed, Jul 29, 2020 at 10:24 AM YueCompl wrote:

Hi Cafe and Ryan,

I tried Map/Set from stm-containers and TSkipList (added range scan api against its internal data structure) from http://hackage.haskell.org/package/tskiplist , with them I've got quite improved at scalability on concurrency.

But unfortunately then I hit another wall at single thread scalability over working memory size, I suspect it's because massively more TVars (those being pointers per se) are introduced by those "contention-free" data structures, they need to mutate separate pointers concurrently in avoiding contentions anyway, but such pointer-intensive heap seems imposing extraordinary pressure to GHC's garbage collector, that GC will dominate CPU utilization with poor business progress.

For example in my test, I use `+RTS -H2g` for the Haskell server process, so GC is not triggered until after a while, then spin off 3 Python client to insert new records concurrently, in the first stage each Python process happily taking ~90% CPU filling (through local mmap) the arrays allocated from the server and logs of success scroll quickly, while the server process utilizes only 30~40% CPU to serve those 3 clients (insert meta data records into unique indices merely); then the client processes' CPU utilization drop drastically once Haskell server process' private memory reached around 2gb, i.e. GC started engaging, the server process's CPU utilization quickly approaches ~300%, while all client processes' drop to 0% for most of the time, and occasionally burst a tiny while with some log output showing progress. And I disable parallel GC lately, enabling parallel GC only makes it worse.

If I comment out the code updating the indices (those creating many TVars), the overall throughput only drop slowly as more data are inserted, the parallelism feels steady even after the server process' private memory takes several GBs.

I didn't expect this, but appears to me that GC of GHC is really not good at handling massive number of pointers in the heap, while those pointers are essential to reduce contention (and maybe expensive data copying too) at heavy parallelism/concurrency.

Btw I tried `+RTS -xn` with GHC 8.10.1 too, no obvious different behavior compared to 8.8.3; and also tried tweaking GC related RTS options a bit, including increasing -G up to 10, no much difference too.

I feel hopeless at the moment, wondering if I'll have to rewrite this in-memory db in Go/Rust or some other runtime ...

Btw I read https://tech.channable.com/posts/2020-04-07-lessons-in-managing-haskell-memo... in searching about the symptoms, and don't feel likely to convert my DB managed data into immutable types thus to fit into Compact Regions, not quite likely a live in-mem database instance can do.

So seems there are good reasons no successful DBMS, at least in-memory ones have been written in Haskell.

Best regards,

Compl

On 2020-07-25, at 22:07, Ryan Yates wrote:

Unfortunately my STM benchmarks are rather disorganized. The most relevant paper using them is:

Leveraging hardware TM in Haskell (PPoPP '19)

https://dl.acm.org/doi/10.1145/3293883.3295711

Or my thesis:

https://urresearch.rochester.edu/institutionalPublicationPublicView.action?i...

The PPoPP benchmarks are on a branch (or the releases tab on github):

https://github.com/fryguybob/ghc-stm-benchmarks/tree/wip/mutable-fields/benc...

All that to say, without an implementation of mutable constructor fields (which I'm working on getting into GHC) the scaling is limited.

Ryan

On Sat, Jul 25, 2020 at 3:45 AM Compl Yue via Haskell-Cafe wrote:

Dear Cafe,

As Chris Allen has suggested, I learned that https://hackage.haskell.org/package/stm-containers and https://hackage.haskell.org/package/ttrie can help a lot when used in place of traditional HashMap for stm tx processing, under heavy concurrency, yet still with automatic parallelism as GHC implemented them. Then I realized that in addition to hash map (used to implement dicts and scopes), I also need to find a TreeMap replacement data structure to implement the db index. I've been focusing on the uniqueness constraint aspect, but it's still an index, needs to provide range scan api for db clients, so hash map is not sufficient for the index.

I see Ryan shared the code benchmarking RBTree with stm in mind:

https://github.com/fryguybob/ghc-stm-benchmarks/tree/master/benchmarks/RBTre...

https://github.com/fryguybob/ghc-stm-benchmarks/tree/master/benchmarks/RBTre...

But can't find conclusion or interpretation of that benchmark suite. And here's a followup question:

Where are some STM contention optimized data structures, that having keys ordered, with sub-range traversing api ?

(of course production ready libraries most desirable)

Thanks with regards,

Compl

On 2020/7/25 下午2:04, Compl Yue via Haskell-Cafe wrote:

Shame on me for I have neither experienced with `perf`, I'd learn these essential tools soon to put them into good use.

It's great to learn about how `orElse` actually works, I did get confused why there are so little retries captured, and now I know. So that little trick should definitely be removed before going production, as it does no much useful things at excessive cost. I put it there to help me understand internal working of stm, now I get even better knowledge ;-)

I think a debugger will trap every single abort, isn't it annoying when many aborts would occur? If I'd like to count the number of aborts, ideally accounted per service endpoints, time periods, source modules etc. there some tricks for that?

Thanks with best regards,

Compl

On 2020/7/25 上午2:02, Ryan Yates wrote:

To be clear, I was trying to refer to Linux `perf` [^1]. Sampling based profiling can do a good job with concurrent and parallel programs where other methods are problematic. For instance,

changing the size of heap objects can drastically change cache performance and completely different behavior can show up.

[^1]: https://en.wikipedia.org/wiki/Perf_(Linux)

The spinning in `readTVar` should always be very short and it typically shows up as intensive CPU use, though it may not be high energy use with `pause` in the loop on x86 (looks like we don't have it [^2], I thought we did, but maybe that was only in some of my code... )

[^2]: https://github.com/ghc/ghc/blob/master/rts/STM.c#L1275

All that to say, I doubt that you are spending much time spinning (but it would certainly be interesting to know if you are! You would see `perf` attribute a large amount of time to `read_current_value`). The amount of code to execute for commit (the time when locks are held) is always much shorter than it takes to execute the transaction body. As you add more conflicting threads this gets worse of course as commits sequence.

The code you have will count commits of executions of `retry`. Note that `retry` is a user level idea, that is, you are counting user level *explicit* retries. This is different from a transaction failing to commit and starting again. These are invisible to the user. Also using your trace will convert `retry` from the efficient wake on write implementation, to an active retry that will always attempt again. We don't have cheap logging of transaction aborts in GHC, but I have built such logging in my work. You can observe these aborts with a debugger by looking for execution of this line:

https://github.com/ghc/ghc/blob/master/rts/STM.c#L1123

Ryan

On Fri, Jul 24, 2020 at 12:35 PM Compl Yue wrote:

I'm not familiar with profiling GHC yet, may need more time to get myself proficient with it.

And a bit more details of my test workload for diagnostic: the db clients are Python processes from a cluster of worker nodes, consulting the db server to register some path for data files, under a data dir within a shared filesystem, then mmap those data files and fill in actual array data. So the db server don't have much computation to perform, but puts the data file path into a global index, which at the same validates its uniqueness. As there are many client processes trying to insert one meta data record concurrently, with my naive implementation, the global index's TVar will almost always in locked state by one client after another, from a queue never fall empty.

So if `readTVar` should spinning waiting, I doubt the threads should actually make high CPU utilization, because at any instant of time, all threads except the committing one will be doing that one thing.

And I have something in my code to track STM retry like this:

```

-- blocking wait not expected, track stm retries explicitly

trackSTM :: Int -> IO (Either () a)

trackSTM !rtc = do

when -- todo increase the threshold of reporting?

(rtc > 0) $ do

-- trace out the retries so the end users can be aware of them

tid <- myThreadId

trace

( "🔙\n"

<> show callCtx

<> "🌀 "

<> show tid

<> " stm retry #"

<> show rtc

)

$ return ()

atomically ((Just <$> stmJob) `orElse` return Nothing) >>= \case

Nothing -> -- stm failed, do a tracked retry

trackSTM (rtc + 1)

Just ... -> ...

```

No such trace msg fires during my test, neither in single thread run, nor in runs with pressure. I'm sure this tracing mechanism works, as I can see such traces fire, in case e.g. posting a TMVar to a TQueue for some other thread to fill it, then read the result out, if these 2 ops are composed into a single tx, then of course it's infinite retry loop, and a sequence of such msgs are logged with ever increasing rtc #.

So I believe no retry has ever been triggered.

What can going on there?

On 2020/7/24 下午11:46, Ryan Yates wrote:

...
Then to explain the low CPU utilization (~10%), am I right to understand it as that upon reading a TVar locked by another committing tx, a lightweight thread will put itself into `waiting STM` and descheduled state, so the CPUs can only stay idle as not so many threads are willing to proceed?

Since the commit happens in finite steps, the expectation is that the lock will be released very soon. Given this when the body of a transaction executes `readTVar` it spins (active CPU!) until the `TVar` is observed unlocked. If a lock is observed while commiting, it immediately starts the transaction again from the beginning. To get the behavior of suspending a transaction you have to successfully commit a transaction that executed `retry`. Then the transaction is put on the wakeup lists of its read set and subsequent commits will wake it up if its write set overlaps.

I don't think any of these things would explain low CPU utilization. You could try running with `perf` and see if lots of time is spent in some recognizable part of the RTS.

Ryan

On Fri, Jul 24, 2020 at 11:22 AM Compl Yue wrote:

Thanks very much for the insightful information Ryan! I'm glad my suspect was wrong about the Haskell scheduler:

...
The Haskell capability that is committing a transaction will not yield to another Haskell thread while it is doing the commit. The OS thread may be preempted, but once commit starts the haskell scheduler is not invoked until after locks are released.

So best effort had already been made in GHC and I just need to cooperate better with its design. Then to explain the low CPU utilization (~10%), am I right to understand it as that upon reading a TVar locked by another committing tx, a lightweight thread will put itself into `waiting STM` and descheduled state, so the CPUs can only stay idle as not so many threads are willing to proceed?

Anyway, I see light with better data structures to improve my situation, let me try them and report back. Actually I later changed `TVar (HaskMap k v)` to be `TVar (HashMap k Int)` where the `Int` being array index into `TVar (Vector (TVar (Maybe v)))`, in pursuing insertion order preservation semantic of dict entries (like that in Python 3.7+), then it's very hopeful to incorporate stm-containers' Map or ttrie to approach free of contention.

Thanks with regards,

Compl

On 2020/7/24 下午10:03, Ryan Yates wrote:

Hi Compl,

Having a pool of transaction processing threads can be helpful in a certain way. If the body of the transaction takes more time to execute then the Haskell thread is allowed and it yields, the suspended thread won't get in the way of other thread, but when it is rescheduled, will have a low probability of success. Even worse, it will probably not discover that it is doomed to failure until commit time. If transactions are more likely to reach commit without yielding, they are more likely to succeed. If the transactions are not conflicting, it doesn't make much difference other than cache churn.

The Haskell capability that is committing a transaction will not yield to another Haskell thread while it is doing the commit. The OS thread may be preempted, but once commit starts the haskell scheduler is not invoked until after locks are released.

To get good performance from STM you must pay attention to what TVars are involved in a commit. All STM systems are working under the assumption of low contention, so you want to minimize "false" conflicts (conflicts that are not essential to the computation). Something like `TVar (HashMap k v)` will work pretty well for a low thread count, but every transaction that writes to that structure will be in conflict with every other transaction that accesses it. Pushing the `TVar` into the nodes of the structure reduces the possibilities for conflict, while increasing the amount of bookkeeping STM has to do. I would like to reduce the cost of that bookkeeping using better structures, but we need to do so without harming performance in the low TVar count case. Right now it is optimized for good cache performance with a handful of TVars.

There is another way to play with performance by moving work into and out of the transaction body. A transaction body that executes quickly will reach commit faster. But it may be delaying work that moves into another transaction. Forcing values at the right time can make a big difference.

Ryan

On Fri, Jul 24, 2020 at 2:14 AM Compl Yue via Haskell-Cafe wrote:

Thanks Chris, I confess I didn't pay enough attention to STM specialized container libraries by far, I skimmed through the description of stm-containers and ttrie, and feel they would definitely improve my code's performance in case I limit the server's parallelism within hardware capabilities. That may because I'm still prototyping the api and infrastructure for correctness, so even `TVar (HashMap k v)` performs okay for me at the moment, only if at low contention (surely there're plenty of CPU cycles to be optimized out in next steps). I model my data after graph model, so most data, even most indices are localized to nodes and edges, those can be manipulated without conflict, that's why I assumed I have a low contention use case since the very beginning - until I found there are still (though minor) needs for global indices to guarantee global uniqueness, I feel faithful with stm-containers/ttrie to implement a more scalable global index data structure, thanks for hinting me.

So an evident solution comes into my mind now, is to run the server with a pool of tx processing threads, matching number of CPU cores, client RPC requests then get queued to be executed in some thread from the pool. But I'm really fond of the mechanism of M:N scheduler which solves massive/dynamic concurrency so elegantly. I had some good result with Go in this regard, and see GHC at par in doing this, I don't want to give up this enjoyable machinery.

But looked at the stm implementation in GHC, it seems written TVars are exclusively locked during commit of a tx, I suspect this is the culprit when there're large M lightweight threads scheduled upon a small N hardware capabilities, that is when a lightweight thread yield control during an stm transaction commit, the TVars it locked will stay so until it's scheduled again (and again) till it can finish the commit. This way, descheduled threads could hold live threads from progressing. I haven't gone into more details there, but wonder if there can be some improvement for GHC RTS to keep an stm committing thread from descheduled, but seemingly that may impose more starvation potential; or stm can be improved to have its TVar locks preemptable when the owner trec/thread is in descheduled state? Neither should be easy but I'd really love massive lightweight threads doing STM practically well.

Best regards,

Compl

On 2020/7/24 上午12:57, Christopher Allen wrote:

It seems like you know how to run practical tests for tuning thread count and contention for throughput. Part of the reason you haven't gotten a super clear answer is "it depends." You give up fairness when you use STM instead of MVars or equivalent structures. That means a long running transaction might get stampeded by many small ones invalidating it over and over. The long-running transaction might never clear if the small transactions keep moving the cheese. I mention this because transaction runtime and size and count all affect throughput and latency. What might be ideal for one pattern of work might not be ideal for another. Optimizing for overall throughput might make the contention and fairness problems worse too. I've done practical tests to optimize this in the past, both for STM in Haskell and for RDBMS workloads.

The next step is sometimes figuring out whether you really need a data structure within a single STM container or if perhaps you can break up your STM container boundaries into zones or regions that roughly map onto update boundaries. That should make the transactions churn less. On the outside chance you do need to touch more than one container in a transaction, well, they compose.

e.g. https://hackage.haskell.org/package/stm-containers

https://hackage.haskell.org/package/ttrie

It also sounds a bit like your question bumps into Amdahl's Law a bit.

All else fails, stop using STM and find something more tuned to your problem space.

Hope this helps,

Chris Allen

On Thu, Jul 23, 2020 at 9:53 AM YueCompl via Haskell-Cafe wrote:

Hello Cafe,

I'm working on an in-memory database, in Client/Server mode I just let each connected client submit remote procedure call running in its dedicated lightweight thread, modifying TVars in RAM per its business needs, then in case many clients connected concurrently and trying to insert new data, if they are triggering global index (some TVar) update, the throughput would drop drastically. I reduced the shared state to a simple int counter by TVar, got same symptom. While the parallelism feels okay when there's no hot TVar conflicting, or M is not much greater than N.

As an empirical test workload, I have a `+RTS -N10` server process, it handles 10 concurrent clients okay, got ~5x of single thread throughput; but in handling 20 concurrent clients, each of the 10 CPUs can only be driven to ~10% utilization, the throughput seems even worse than single thread. More clients can even drive it thrashing without much progressing.

I can understand that pure STM doesn't scale well after reading [1], and I see it suggested [7] attractive and planned future work toward that direction.

But I can't find certain libraries or frameworks addressing large M over small N scenarios, [1] experimented with designated N parallelism, and [7] is rather theoretical to my empirical needs.

Can you direct me to some available library implementing the methodology proposed in [7] or other ways tackling this problem?

I think the most difficult one is that a transaction should commit with global indices (with possibly unique constraints) atomically updated, and rollback with any violation of constraints, i.e. transactions have to cover global states like indices. Other problems seem more trivial than this.

Specifically, [7] states:

...
It must be emphasized that all of the mechanisms we deploy originate, in one form or another, in the database literature from the 70s and 80s. Our contribution is to adapt these techniques to software transactional memory, providing more effective solutions to important STM problems than prior proposals.

I wonder any STM based library has simplified those techniques to be composed right away? I don't really want to implement those mechanisms by myself, rebuilding many wheels from scratch.

Best regards,

Compl

[1] Comparing the performance of concurrent linked-list implementations in Haskell

https://simonmar.github.io/bib/papers/concurrent-data.pdf

[7] M. Herlihy and E. Koskinen. Transactional boosting: a methodology for highly-concurrent transactional objects. In Proc. of PPoPP ’08, pages 207–216. ACM Press, 2008.

https://www.cs.stevens.edu/~ejk/papers/boosting-ppopp08.pdf

_______________________________________________ Haskell-Cafe mailing list To (un)subscribe, modify options or view archives go to: http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe Only members subscribed via the mailman list are allowed to post.

--

Chris Allen

Currently working on http://haskellbook.com

_______________________________________________ Haskell-Cafe mailing list To (un)subscribe, modify options or view archives go to: http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe Only members subscribed via the mailman list are allowed to post.

_______________________________________________ Haskell-Cafe mailing list To (un)subscribe, modify options or view archives go to: http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe Only members subscribed via the mailman list are allowed to post. _______________________________________________ Haskell-Cafe mailing list To (un)subscribe, modify options or view archives go to: http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe Only members subscribed via the mailman list are allowed to post.

_______________________________________________ Haskell-Cafe mailing list To (un)subscribe, modify options or view archives go to: http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe Only members subscribed via the mailman list are allowed to post.

_______________________________________________ Haskell-Cafe mailing list To (un)subscribe, modify options or view archives go to: http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe Only members subscribed via the mailman list are allowed to post.

_______________________________________________ Haskell-Cafe mailing list To (un)subscribe, modify options or view archives go to: http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe Only members subscribed via the mailman list are allowed to post.

Ben Gamari

31 Jul 31 Jul

1:36 p.m.

New subject: STM friendly TreeMap (or similar with range scan api) ? WAS: Best ways to achieve throughput, for large M:N ratio of STM threads, with hot TVar updates?

Simon Peyton Jones via Haskell-Cafe writes:

...

...
Compl’s problem is (apparently) that execution becomes dominated by GC. That doesn’t sound like a constant-factor overhead from TVars, no matter how efficient (or otherwise) they are. It sounds more like a space leak to me; perhaps you need some strict evaluation or something.

My point is only: before re-engineering STM it would make sense to get a much more detailed insight into what is actually happening, and where the space and time is going. We have tools to do this (heap profiling, Threadscope, …) but I know they need some skill and insight to use well. But we don’t have nearly enough insight to draw meaningful conclusions yet.

Maybe someone with experience of performance debugging might feel able to help Compl?

Compl, If you want to discuss the issue feel free to get in touch on IRC. I would be happy to help. It would be great if we had something of a decision tree for performance tuning of Haskell code in the users guide or Wiki. We have so many tools yet there isn't a comprehensive overview of 1. what factors might affect which runtime characteristics of your program 2. which tools can be used to measure which factors 3. how these factors can be improved Cheers, - Ben

YueCompl

2:35 p.m.

New subject: STM friendly TreeMap (or similar with range scan api) ? WAS: Best ways to achieve throughput, for large M:N ratio of STM threads, with hot TVar updates?

Hi Ben, Thanks as always for your great support! And at the moment I'm working on a minimum working example to reproduce the symptoms, I intend to work out a program depends only on libraries bundled with GHC, so it can be easily diagnosed without my complex env, but so far no reprod yet. I'll come with some piece of code once it can reproduce something. Thanks in advance. Sincerely, Compl

...

On 2020-07-31, at 21:36, Ben Gamari wrote:

Simon Peyton Jones via Haskell-Cafe writes:

...
...
Compl’s problem is (apparently) that execution becomes dominated by GC. That doesn’t sound like a constant-factor overhead from TVars, no matter how efficient (or otherwise) they are. It sounds more like a space leak to me; perhaps you need some strict evaluation or something.

My point is only: before re-engineering STM it would make sense to get a much more detailed insight into what is actually happening, and where the space and time is going. We have tools to do this (heap profiling, Threadscope, …) but I know they need some skill and insight to use well. But we don’t have nearly enough insight to draw meaningful conclusions yet.

Maybe someone with experience of performance debugging might feel able to help Compl?

Compl,

If you want to discuss the issue feel free to get in touch on IRC. I would be happy to help.

It would be great if we had something of a decision tree for performance tuning of Haskell code in the users guide or Wiki. We have so many tools yet there isn't a comprehensive overview of

1. what factors might affect which runtime characteristics of your program 2. which tools can be used to measure which factors 3. how these factors can be improved

Cheers,

- Ben _______________________________________________ Haskell-Cafe mailing list To (un)subscribe, modify options or view archives go to: http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe Only members subscribed via the mailman list are allowed to post.

Compl Yue

6 Aug 6 Aug

5:50 p.m.

New subject: STM friendly TreeMap (or similar with range scan api) ? WAS: Best ways to achieve throughput, for large M:N ratio of STM threads, with hot TVar updates?

Hi Devs & Cafe, I would report back my progress on it, actually I've got a rough conclusion that TL;DR:

...

For data-intensive workloads, x86_64 ISA has its cache of CPU chips being a hardware bottleneck, it's very hard to scale up with added number of cores, so long as they share the cache as being in a single chip.

For the details - I developed a minimal script interpreter for diagnostic purpose, dependent only on libraries bundled with GHC, the source repository is at: https://github.com/complyue/txs https://github.com/complyue/txs I benchmarked on my machine with a single 6-core Xeon E5 CPU chip, for contention-free read/write performance scaling, got numbers at: https://github.com/complyue/txs/blob/master/results/baseline.csv https://github.com/complyue/txs/blob/master/results/baseline.csv conc thread avg tps scale eff populate 1 1741 1.00 1.00 2 1285 1.48 0.74 3 1028 1.77 0.59 4 843 1.94 0.48 5 696 2.00 0.40 6 600 2.07 0.34 scan 1 1565 1.00 1.00 2 1285 1.64 0.82 3 1018 1.95 0.65 4 843 2.15 0.54 5 696 2.22 0.44 6 586 2.25 0.37 The script is at: https://github.com/complyue/txs/blob/master/scripts/scan.txs https://github.com/complyue/txs/blob/master/scripts/scan.txs GHC cmdl is at: https://github.com/complyue/txs/blob/master/metric.bash https://github.com/complyue/txs/blob/master/metric.bash ghc --make -Wall -threaded -rtsopts -prof -o txs -outputdir . -stubdir . -i../src ../src/Main.hs && ( ./txs +RTS -N10 -A32m -H256m -qg -I0 -M5g -T -s <../scripts/"${SCRIPT}".txs ) I intended to use a single Haskell based process to handle meta data about many ndarrays being crunched, acting as a centralized graph database, as it turned out, many clients queued to query/insert meta data against a single database node, will create such high data throughput that just few CPU chips can't handle well, we didn't expect this but apparently we'll have to deploy more machines as for such a database instance, with data partitioned and distributed to more nodes for load balancing. (A single machine with many sockets for CPU thus many NUMA nodes is neither an option for us.) While the flexibility a central graph database would provide, is not currently a crucial requirement of our business, so we are not interested to further develop this database system. We currently have CPU intensive workloads handled by some cluster of machines running Python processes (crunching numbers with Numpy and C++ tensors), while some Haskell based number crunching software are still under development, it may turn out some day in the future, that some heavier computation be bound with the db access, effectively creating some CPU intensive workloads for the database functionality, then we'll have the opportunity to dive deeper into the database implementation. And in case more flexibility required in near future, I think I'll tend to implement embedded database instances in those worker processes, in contrast to centralized db servers. I wonder if ARM servers will have up scaling of data intensive workloads easier, though that's neither a near feasible option for us. Thanks for everyone that have been helpful! Best regards, Compl

...

On 2020-07-31, at 22:35, YueCompl via Haskell-Cafe wrote:

Hi Ben,

Thanks as always for your great support! And at the moment I'm working on a minimum working example to reproduce the symptoms, I intend to work out a program depends only on libraries bundled with GHC, so it can be easily diagnosed without my complex env, but so far no reprod yet. I'll come with some piece of code once it can reproduce something.

Thanks in advance.

Sincerely, Compl

...
On 2020-07-31, at 21:36, Ben Gamari wrote:

Simon Peyton Jones via Haskell-Cafe writes:

...
...
Compl’s problem is (apparently) that execution becomes dominated by GC. That doesn’t sound like a constant-factor overhead from TVars, no matter how efficient (or otherwise) they are. It sounds more like a space leak to me; perhaps you need some strict evaluation or something.

My point is only: before re-engineering STM it would make sense to get a much more detailed insight into what is actually happening, and where the space and time is going. We have tools to do this (heap profiling, Threadscope, …) but I know they need some skill and insight to use well. But we don’t have nearly enough insight to draw meaningful conclusions yet.

Maybe someone with experience of performance debugging might feel able to help Compl?

Compl,

If you want to discuss the issue feel free to get in touch on IRC. I would be happy to help.

It would be great if we had something of a decision tree for performance tuning of Haskell code in the users guide or Wiki. We have so many tools yet there isn't a comprehensive overview of

1. what factors might affect which runtime characteristics of your program 2. which tools can be used to measure which factors 3. how these factors can be improved

Cheers,

- Ben _______________________________________________ Haskell-Cafe mailing list To (un)subscribe, modify options or view archives go to: http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe Only members subscribed via the mailman list are allowed to post.

_______________________________________________ Haskell-Cafe mailing list To (un)subscribe, modify options or view archives go to: http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe Only members subscribed via the mailman list are allowed to post.

Ryan Yates

25 Jul 25 Jul

1:48 p.m.

I have never done it, but I think you can make GDB count the times a breakpoint is hit using conditional breakpoints. Someone else may know of better tools. On Sat, Jul 25, 2020 at 2:04 AM Compl Yue wrote:

...

Shame on me for I have neither experienced with `perf`, I'd learn these essential tools soon to put them into good use.

It's great to learn about how `orElse` actually works, I did get confused why there are so little retries captured, and now I know. So that little trick should definitely be removed before going production, as it does no much useful things at excessive cost. I put it there to help me understand internal working of stm, now I get even better knowledge ;-)

I think a debugger will trap every single abort, isn't it annoying when many aborts would occur? If I'd like to count the number of aborts, ideally accounted per service endpoints, time periods, source modules etc. there some tricks for that?

Thanks with best regards,

Compl

On 2020/7/25 上午2:02, Ryan Yates wrote:

To be clear, I was trying to refer to Linux `perf` [^1]. Sampling based profiling can do a good job with concurrent and parallel programs where other methods are problematic. For instance, changing the size of heap objects can drastically change cache performance and completely different behavior can show up.

[^1]: https://en.wikipedia.org/wiki/Perf_(Linux)

The spinning in `readTVar` should always be very short and it typically shows up as intensive CPU use, though it may not be high energy use with `pause` in the loop on x86 (looks like we don't have it [^2], I thought we did, but maybe that was only in some of my code... )

[^2]: https://github.com/ghc/ghc/blob/master/rts/STM.c#L1275

All that to say, I doubt that you are spending much time spinning (but it would certainly be interesting to know if you are! You would see `perf` attribute a large amount of time to `read_current_value`). The amount of code to execute for commit (the time when locks are held) is always much shorter than it takes to execute the transaction body. As you add more conflicting threads this gets worse of course as commits sequence.

The code you have will count commits of executions of `retry`. Note that `retry` is a user level idea, that is, you are counting user level *explicit* retries. This is different from a transaction failing to commit and starting again. These are invisible to the user. Also using your trace will convert `retry` from the efficient wake on write implementation, to an active retry that will always attempt again. We don't have cheap logging of transaction aborts in GHC, but I have built such logging in my work. You can observe these aborts with a debugger by looking for execution of this line:

https://github.com/ghc/ghc/blob/master/rts/STM.c#L1123

Ryan

On Fri, Jul 24, 2020 at 12:35 PM Compl Yue wrote:

...
I'm not familiar with profiling GHC yet, may need more time to get myself proficient with it.

And a bit more details of my test workload for diagnostic: the db clients are Python processes from a cluster of worker nodes, consulting the db server to register some path for data files, under a data dir within a shared filesystem, then mmap those data files and fill in actual array data. So the db server don't have much computation to perform, but puts the data file path into a global index, which at the same validates its uniqueness. As there are many client processes trying to insert one meta data record concurrently, with my naive implementation, the global index's TVar will almost always in locked state by one client after another, from a queue never fall empty.

So if `readTVar` should spinning waiting, I doubt the threads should actually make high CPU utilization, because at any instant of time, all threads except the committing one will be doing that one thing.

And I have something in my code to track STM retry like this:

``` -- blocking wait not expected, track stm retries explicitly trackSTM :: Int -> IO (Either () a) trackSTM !rtc = do when -- todo increase the threshold of reporting? (rtc > 0) $ do -- trace out the retries so the end users can be aware of them tid <- myThreadId trace ( "🔙\n" <> show callCtx <> "🌀 " <> show tid <> " stm retry #" <> show rtc ) $ return () atomically ((Just <$> stmJob) `orElse` return Nothing) >>= \case Nothing -> -- stm failed, do a tracked retry trackSTM (rtc + 1) Just ... -> ...

```

No such trace msg fires during my test, neither in single thread run, nor in runs with pressure. I'm sure this tracing mechanism works, as I can see such traces fire, in case e.g. posting a TMVar to a TQueue for some other thread to fill it, then read the result out, if these 2 ops are composed into a single tx, then of course it's infinite retry loop, and a sequence of such msgs are logged with ever increasing rtc #.

So I believe no retry has ever been triggered.

What can going on there?

On 2020/7/24 下午11:46, Ryan Yates wrote:

...
Then to explain the low CPU utilization (~10%), am I right to understand it as that upon reading a TVar locked by another committing tx, a lightweight thread will put itself into `waiting STM` and descheduled state, so the CPUs can only stay idle as not so many threads are willing to proceed?

Since the commit happens in finite steps, the expectation is that the lock will be released very soon. Given this when the body of a transaction executes `readTVar` it spins (active CPU!) until the `TVar` is observed unlocked. If a lock is observed while commiting, it immediately starts the transaction again from the beginning. To get the behavior of suspending a transaction you have to successfully commit a transaction that executed `retry`. Then the transaction is put on the wakeup lists of its read set and subsequent commits will wake it up if its write set overlaps.

I don't think any of these things would explain low CPU utilization. You could try running with `perf` and see if lots of time is spent in some recognizable part of the RTS.

Ryan

On Fri, Jul 24, 2020 at 11:22 AM Compl Yue wrote:

...
Thanks very much for the insightful information Ryan! I'm glad my suspect was wrong about the Haskell scheduler:

...
The Haskell capability that is committing a transaction will not yield to another Haskell thread while it is doing the commit. The OS thread may be preempted, but once commit starts the haskell scheduler is not invoked until after locks are released. So best effort had already been made in GHC and I just need to cooperate better with its design. Then to explain the low CPU utilization (~10%), am I right to understand it as that upon reading a TVar locked by another committing tx, a lightweight thread will put itself into `waiting STM` and descheduled state, so the CPUs can only stay idle as not so many threads are willing to proceed?

Anyway, I see light with better data structures to improve my situation, let me try them and report back. Actually I later changed `TVar (HaskMap k v)` to be `TVar (HashMap k Int)` where the `Int` being array index into `TVar (Vector (TVar (Maybe v)))`, in pursuing insertion order preservation semantic of dict entries (like that in Python 3.7+), then it's very hopeful to incorporate stm-containers' Map or ttrie to approach free of contention.

Thanks with regards,

Compl

On 2020/7/24 下午10:03, Ryan Yates wrote:

Hi Compl,

Having a pool of transaction processing threads can be helpful in a certain way. If the body of the transaction takes more time to execute then the Haskell thread is allowed and it yields, the suspended thread won't get in the way of other thread, but when it is rescheduled, will have a low probability of success. Even worse, it will probably not discover that it is doomed to failure until commit time. If transactions are more likely to reach commit without yielding, they are more likely to succeed. If the transactions are not conflicting, it doesn't make much difference other than cache churn.

The Haskell capability that is committing a transaction will not yield to another Haskell thread while it is doing the commit. The OS thread may be preempted, but once commit starts the haskell scheduler is not invoked until after locks are released.

To get good performance from STM you must pay attention to what TVars are involved in a commit. All STM systems are working under the assumption of low contention, so you want to minimize "false" conflicts (conflicts that are not essential to the computation). Something like `TVar (HashMap k v)` will work pretty well for a low thread count, but every transaction that writes to that structure will be in conflict with every other transaction that accesses it. Pushing the `TVar` into the nodes of the structure reduces the possibilities for conflict, while increasing the amount of bookkeeping STM has to do. I would like to reduce the cost of that bookkeeping using better structures, but we need to do so without harming performance in the low TVar count case. Right now it is optimized for good cache performance with a handful of TVars.

There is another way to play with performance by moving work into and out of the transaction body. A transaction body that executes quickly will reach commit faster. But it may be delaying work that moves into another transaction. Forcing values at the right time can make a big difference.

Ryan

On Fri, Jul 24, 2020 at 2:14 AM Compl Yue via Haskell-Cafe < haskell-cafe@haskell.org> wrote:

...
Thanks Chris, I confess I didn't pay enough attention to STM specialized container libraries by far, I skimmed through the description of stm-containers and ttrie, and feel they would definitely improve my code's performance in case I limit the server's parallelism within hardware capabilities. That may because I'm still prototyping the api and infrastructure for correctness, so even `TVar (HashMap k v)` performs okay for me at the moment, only if at low contention (surely there're plenty of CPU cycles to be optimized out in next steps). I model my data after graph model, so most data, even most indices are localized to nodes and edges, those can be manipulated without conflict, that's why I assumed I have a low contention use case since the very beginning - until I found there are still (though minor) needs for global indices to guarantee global uniqueness, I feel faithful with stm-containers/ttrie to implement a more scalable global index data structure, thanks for hinting me.

So an evident solution comes into my mind now, is to run the server with a pool of tx processing threads, matching number of CPU cores, client RPC requests then get queued to be executed in some thread from the pool. But I'm really fond of the mechanism of M:N scheduler which solves massive/dynamic concurrency so elegantly. I had some good result with Go in this regard, and see GHC at par in doing this, I don't want to give up this enjoyable machinery.

But looked at the stm implementation in GHC, it seems written TVars are exclusively locked during commit of a tx, I suspect this is the culprit when there're large M lightweight threads scheduled upon a small N hardware capabilities, that is when a lightweight thread yield control during an stm transaction commit, the TVars it locked will stay so until it's scheduled again (and again) till it can finish the commit. This way, descheduled threads could hold live threads from progressing. I haven't gone into more details there, but wonder if there can be some improvement for GHC RTS to keep an stm committing thread from descheduled, but seemingly that may impose more starvation potential; or stm can be improved to have its TVar locks preemptable when the owner trec/thread is in descheduled state? Neither should be easy but I'd really love massive lightweight threads doing STM practically well.

Best regards,

Compl

On 2020/7/24 上午12:57, Christopher Allen wrote:

It seems like you know how to run practical tests for tuning thread count and contention for throughput. Part of the reason you haven't gotten a super clear answer is "it depends." You give up fairness when you use STM instead of MVars or equivalent structures. That means a long running transaction might get stampeded by many small ones invalidating it over and over. The long-running transaction might never clear if the small transactions keep moving the cheese. I mention this because transaction runtime and size and count all affect throughput and latency. What might be ideal for one pattern of work might not be ideal for another. Optimizing for overall throughput might make the contention and fairness problems worse too. I've done practical tests to optimize this in the past, both for STM in Haskell and for RDBMS workloads.

The next step is sometimes figuring out whether you really need a data structure within a single STM container or if perhaps you can break up your STM container boundaries into zones or regions that roughly map onto update boundaries. That should make the transactions churn less. On the outside chance you do need to touch more than one container in a transaction, well, they compose.

e.g. https://hackage.haskell.org/package/stm-containers https://hackage.haskell.org/package/ttrie

It also sounds a bit like your question bumps into Amdahl's Law a bit.

All else fails, stop using STM and find something more tuned to your problem space.

Hope this helps, Chris Allen

On Thu, Jul 23, 2020 at 9:53 AM YueCompl via Haskell-Cafe < haskell-cafe@haskell.org> wrote:

...
Hello Cafe,

I'm working on an in-memory database, in Client/Server mode I just let each connected client submit remote procedure call running in its dedicated lightweight thread, modifying TVars in RAM per its business needs, then in case many clients connected concurrently and trying to insert new data, if they are triggering global index (some TVar) update, the throughput would drop drastically. I reduced the shared state to a simple int counter by TVar, got same symptom. While the parallelism feels okay when there's no hot TVar conflicting, or M is not much greater than N.

As an empirical test workload, I have a `+RTS -N10` server process, it handles 10 concurrent clients okay, got ~5x of single thread throughput; but in handling 20 concurrent clients, each of the 10 CPUs can only be driven to ~10% utilization, the throughput seems even worse than single thread. More clients can even drive it thrashing without much progressing.

I can understand that pure STM doesn't scale well after reading [1], and I see it suggested [7] attractive and planned future work toward that direction.

But I can't find certain libraries or frameworks addressing large M over small N scenarios, [1] experimented with designated N parallelism, and [7] is rather theoretical to my empirical needs.

Can you direct me to some available library implementing the methodology proposed in [7] or other ways tackling this problem?

I think the most difficult one is that a transaction should commit with global indices (with possibly unique constraints) atomically updated, and rollback with any violation of constraints, i.e. transactions have to cover global states like indices. Other problems seem more trivial than this.

Specifically, [7] states:

...
It must be emphasized that all of the mechanisms we deploy originate, in one form or another, in the database literature from the 70s and 80s. Our contribution is to adapt these techniques to software transactional memory, providing more effective solutions to important STM problems than prior proposals.

I wonder any STM based library has simplified those techniques to be composed right away? I don't really want to implement those mechanisms by myself, rebuilding many wheels from scratch.

Best regards, Compl

[1] Comparing the performance of concurrent linked-list implementations in Haskell https://simonmar.github.io/bib/papers/concurrent-data.pdf

[7] M. Herlihy and E. Koskinen. Transactional boosting: a methodology for highly-concurrent transactional objects. In Proc. of PPoPP ’08, pages 207–216. ACM Press, 2008. https://www.cs.stevens.edu/~ejk/papers/boosting-ppopp08.pdf

_______________________________________________ Haskell-Cafe mailing list To (un)subscribe, modify options or view archives go to: http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe Only members subscribed via the mailman list are allowed to post.

-- Chris Allen Currently working on http://haskellbook.com

_______________________________________________ Haskell-Cafe mailing list To (un)subscribe, modify options or view archives go to: http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe Only members subscribed via the mailman list are allowed to post.

1796

Age (days ago)

1810

Last active (days ago)

List overview

Download

28 comments

9 participants

participants (9)

Ben Gamari
Christopher Allen
Compl Yue
Compl Yue
Joachim Durchholz
Merijn Verstraaten
Ryan Yates
Simon Peyton Jones
YueCompl

Best ways to achieve throughput, for large M:N ratio of STM threads, with hot TVar updates?

tags

participants (9)