
While I can't contribute any Haskell knowledge, I know that many threads updating the same variable is the worst thing you can do; not only do you create a single bottleneck, if you have your threads running on multiple cores you get CPU pipeline stalls, L1 cache line flushes, and/or complicated cache coherency protocols executed between cores. It's not cheap: each of these mechanisms can take hundreds of CPU cycles, for a CPU that can execute multiple instructions per CPU cycle. Incrementing a global counter is a really hard problem in multithreading... I believe this is the reason why databases typically implement a SEQUENCE mechanism, and these sequences are usually implemented as "whenever a transaction asks for a sequence number, reserve a block of 1,000 numbers for it so it can retrieve 999 additional numbers without the synchronization overhead. This is also why real databases use transactions - these do not just isolate processes from each other's updates, they also allow the DB to let the transaction work on a snapshot and do all the synchronization once, during COMMIT. And, as you just discovered, it's one of the major optimization areas in database engines :-) TL;DR for the bad news: I suspect your problem is just unavoidable However, I see a workaround: delayed index update. Have each index twice: last-known and next-future. last-known is what was created during the last index update. You need an append-only list of records that had an index field update, and all searches that use the index will also have to do a linear search in that list. next-future is built in the background. It takes last-known and the updates from the append-only list, and generates a new index. Once next-future is finished, replace last-known with it. You still need to to a global lock while replacing indexes, but you don't have to lock the index for every single update but just once. You'll have to twiddle with parameters such as "at what point do I start a new index build", and you'll have to make sure that your linear list isn't yet another bottleneck (there are lock-free data structures to achieve such a thing, but these are complicated; or you can tell application programmers to try and collect as many updates as possible in a transaction so the number of synchronization points is smaller; however, too-large transactions can generate CPU cache overflows if the collected update data becomes too large, so there's a whole lot of tweaking, studying real performance data, hopefully finding the right set of diagnostic information to collect that allow the DB to automatically choose the right point to do its updates, etc. pp.) TL;DR for the good news: You can coalesce N updates into one and divide the CPU core coordination overhead by a factor of N. You'll increase the bus pressure, so there's tons of fine tuning you can do (or avoid) after getting the first 90% of the speedup. (I'm drawing purely speculative numbers out of my hat here.) Liability: You will want to add transactions and (likely) optimistic locking, if you don't have that already: Transaction boundaries are the natural point for coalescing updates. Regards, Jo