Re: [Haskell-cafe] Best ways to achieve throughput, for large M:N ratio of STM threads, with hot TVar updates?

24 Jul 2020

      Thanks Chris, I confess I didn't pay enough attention to STM specialized 
container libraries by far, I skimmed through the description of 
stm-containers and ttrie, and feel they would definitely improve my 
code's performance in case I limit the server's parallelism within 
hardware capabilities. That may because I'm still prototyping the api 
and infrastructure for correctness, so even `TVar (HashMap k v)` 
performs okay for me at the moment, only if at low contention (surely 
there're plenty of CPU cycles to be optimized out in next steps). I 
model my data after graph model, so most data, even most indices are 
localized to nodes and edges, those can be manipulated without conflict, 
that's why I assumed I have a low contention use case since the very 
beginning - until I found there are still (though minor) needs for 
global indices to guarantee global uniqueness, I feel faithful with 
stm-containers/ttrie to implement a more scalable global index data 
structure, thanks for hinting me.

So an evident solution comes into my mind now, is to run the server with 
a pool of tx processing threads, matching number of CPU cores, client 
RPC requests then get queued to be executed in some thread from the 
pool. But I'm really fond of the mechanism of M:N scheduler which solves 
massive/dynamic concurrency so elegantly. I had some good result with Go 
in this regard, and see GHC at par in doing this, I don't want to give 
up this enjoyable machinery.

But looked at the stm implementation in GHC, it seems written TVars are 
exclusively locked during commit of a tx, I suspect this is the culprit 
when there're large M lightweight threads scheduled upon a small N 
hardware capabilities, that is when a lightweight thread yield control 
during an stm transaction commit, the TVars it locked will stay so until 
it's scheduled again (and again) till it can finish the commit. This 
way, descheduled threads could hold live threads from progressing. I 
haven't gone into more details there, but wonder if there can be some 
improvement for GHC RTS to keep an stm committing thread from 
descheduled, but seemingly that may impose more starvation potential; or 
stm can be improved to have its TVar locks preemptable when the owner 
trec/thread is in descheduled state? Neither should be easy but I'd 
really love massive lightweight threads doing STM practically well.

Best regards,

Compl

On 2020/7/24 上午12:57, Christopher Allen wrote:
...
It seems like you know how to run practical tests for tuning thread 
count and contention for throughput. Part of the reason you haven't 
gotten a super clear answer is "it depends." You give up fairness when 
you use STM instead of MVars or equivalent structures. That means a 
long running transaction might get stampeded by many small ones 
invalidating it over and over. The long-running transaction might 
never clear if the small transactions keep moving the cheese. I 
mention this because transaction runtime and size and count all affect 
throughput and latency. What might be ideal for one pattern of work 
might not be ideal for another. Optimizing for overall throughput 
might make the contention and fairness problems worse too. I've done 
practical tests to optimize this in the past, both for STM in Haskell 
and for RDBMS workloads.
The next step is sometimes figuring out whether you really need a data 
structure within a single STM container or if perhaps you can break up 
your STM container boundaries into zones or regions that roughly map 
onto update boundaries. That should make the transactions churn less. 
On the outside chance you do need to touch more than one container in 
a transaction, well, they compose.
e.g. https://hackage.haskell.org/package/stm-containers
https://hackage.haskell.org/package/ttrie
It also sounds a bit like your question bumps into Amdahl's Law a bit.
All else fails, stop using STM and find something more tuned to your 
problem space.
Hope this helps,
Chris Allen
On Thu, Jul 23, 2020 at 9:53 AM YueCompl via Haskell-Cafe 
mailto:haskell-cafe@haskell.org> wrote:
Hello Cafe,
I'm working on an in-memory database, in Client/Server mode I just
    let each connected client submit remote procedure call running in
    its dedicated lightweight thread, modifying TVars in RAM per its
    business needs, then in case many clients connected concurrently
    and trying to insert new data, if they are triggering global index
    (some TVar) update, the throughput would drop drastically. I
    reduced the shared state to a simple int counter by TVar, got same
    symptom. While the parallelism feels okay when there's no hot TVar
    conflicting, or M is not much greater than N.
As an empirical test workload, I have a `+RTS -N10` server
    process, it handles 10 concurrent clients okay, got ~5x of single
    thread throughput; but in handling 20 concurrent clients, each of
    the 10 CPUs can only be driven to ~10% utilization, the throughput
    seems even worse than single thread. More clients can even drive
    it thrashing without much  progressing.
 I can understand that pure STM doesn't scale well after reading
    [1], and I see it suggested [7] attractive and planned future work
    toward that direction.
But I can't find certain libraries or frameworks addressing large
    M over small N scenarios, [1] experimented with designated N
    parallelism, and [7] is rather theoretical to my empirical needs.
Can you direct me to some available library implementing the
    methodology proposed in [7] or other ways tackling this problem?
I think the most difficult one is that a transaction should commit
    with global indices (with possibly unique constraints) atomically
    updated, and rollback with any violation of constraints, i.e.
    transactions have to cover global states like indices. Other
    problems seem more trivial than this.
Specifically, [7] states:
> It must be emphasized that all of the mechanisms we deploy
    originate, in one form or another, in the database literature from
    the 70s and 80s. Our contribution is to adapt these techniques to
    software transactional memory, providing more effective solutions
    to important STM problems than prior proposals.
I wonder any STM based library has simplified those techniques to
    be composed right away? I don't really want to implement those
    mechanisms by myself, rebuilding many wheels from scratch.
Best regards,
    Compl
[1] Comparing the performance of concurrent linked-list
    implementations in Haskell
    https://simonmar.github.io/bib/papers/concurrent-data.pdf
[7] M. Herlihy and E. Koskinen. Transactional boosting: a
    methodology for highly-concurrent transactional objects. In Proc.
    of PPoPP ’08, pages 207–216. ACM Press, 2008.
    https://www.cs.stevens.edu/~ejk/papers/boosting-ppopp08.pdf
_______________________________________________
    Haskell-Cafe mailing list
    To (un)subscribe, modify options or view archives go to:
    http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe
    Only members subscribed via the mailman list are allowed to post.
-- 
Chris Allen
Currently working on http://haskellbook.com

Re: [Haskell-cafe] Best ways to achieve throughput, for large M:N ratio of STM threads, with hot TVar updates?

Compl Yue