[sajith@gmail.com: Google Summer of Code: a NUMA wishlist!]

Greetings,
I originally posted this message to haskell-cafe, and was told that
this list might be a more appropriate place to seek feedback. I'm
curious to know what people think of this proposal.
Thanks in advance!
Regards,
Sajith.
----- Forwarded message from Sajith T S

On 26/03/2012 04:25, Sajith T S wrote:
Date: Sun, 25 Mar 2012 22:49:52 -0400 From: Sajith T S
To: The Haskell Cafe Subject: Google Summer of Code: a NUMA wishlist! Dear Cafe,
It's last minute-ish to bring this up (in my part of the world it's still March 25), but graduate students are famously a busy and lazy lot. :) I study at Indiana University Bloomington, and I wish to propose^W rush in this proposal and solicit feedback, mentors, etc while I can.
Since student application deadline is April 6, I figure we can beat this into a real proposal's shape by then. This probably also falls on the naive and ambitious side of things, and I might not even know what I'm talking about, but let's see! That's the idea of proposal, yes?
Broadly, idea is to improve support for NUMA systems. Specifically:
-- Real physical processor affinity with forkOn [1]. Can we fire all CPUs if we want to? (Currently, the number passed to forkOn is interpreted as number modulo the value returned by getNumCapabilities [2]).
You can get real processor affinity with +RTS -qa in combination with forkOn.
-- Also kind of associated with the above: when launching processes, we might want to specify a list of CPUs rather than the number of CPUs. Say, a -N [0,1,3] flag rather than -N 3 flag. This shall enable us to gawk at real pretty htop [3] output.
I like that idea.
-- From a very recent discussion on parallel-haskell [4], we learn that RTS' NUMA support could be improved. The hypothesis is that allocating nurseries per Capability might be a better plan than using global pool. We might borrow/steal ideas from hwloc [5] for this.
I like this idea too (since I suggested it :-).
-- Finally, a logging/monitoring infrastructure to verify assumptions and determine if/how local work stays.
I'm not sure if you're suggesting a *new* logging/monitoring framework
here, but in any case it would make much more sense to extend ghc-events
and ThreadScope rather than building something new. There is ongoing
work to have ThreadScope understand the output of the Linux "perf" tool,
which would give insight into CPU scheduling activity amongst other
things. Talk to Duncan Coutts
(I would like to acknowledge my fellow conspirators and leave them unnamed, lest they shall be embarrassed by my... naivete.)
Thanks, Sajith.
[1] http://www.haskell.org/ghc/docs/latest/html/libraries/base/Control-Concurren... [2] http://www.haskell.org/ghc/docs/latest/html/libraries/base/Control-Concurren... [3] http://htop.sourceforge.net/ [4] http://groups.google.com/group/parallel-haskell/browse_thread/thread/7ec1ebc... [5] http://www.open-mpi.org/projects/hwloc/

Hi Simon,
Thanks for the reply. It seems that forwarding the message here was a
very good idea!
Simon Marlow
-- From a very recent discussion on parallel-haskell [4], we learn that RTS' NUMA support could be improved. The hypothesis is that allocating nurseries per Capability might be a better plan than using global pool. We might borrow/steal ideas from hwloc [5] for this.
I like this idea too (since I suggested it :-).
I guess you will also be available for eventual pestering about this stuff, then? :)
-- Finally, a logging/monitoring infrastructure to verify assumptions and determine if/how local work stays.
I'm not sure if you're suggesting a *new* logging/monitoring framework here, but in any case it would make much more sense to extend ghc-events and ThreadScope rather than building something new. There is ongoing work to have ThreadScope understand the output of the Linux "perf" tool, which would give insight into CPU scheduling activity amongst other things. Talk to Duncan Coutts
about how far this is along and the best way for a GSoc project to help (usually it works best when the GSoc project is not dependent on, or depended on by, other ongoing projects - reducing synchronisation overhead and latency due to blocking is always good!).
Again, thanks for all this information. Surely, enhancing existing machinery would make sense rather than building brand new ones. There's a ticket for this proposal now, and it would be great to get (more?) feedback there or in this list or some other suitable place. Obviously we'd need to rework some of it, especially the "thread pinning" part. http://hackage.haskell.org/trac/summer-of-code/ticket/1618 Regards, Sajith. -- "the lyf so short, the craft so long to lerne." -- Chaucer.

On 27/03/2012 01:14, Sajith T S wrote:
Hi Simon,
Thanks for the reply. It seems that forwarding the message here was a very good idea!
Simon Marlow
wrote: -- From a very recent discussion on parallel-haskell [4], we learn that RTS' NUMA support could be improved. The hypothesis is that allocating nurseries per Capability might be a better plan than using global pool. We might borrow/steal ideas from hwloc [5] for this.
I like this idea too (since I suggested it :-).
I guess you will also be available for eventual pestering about this stuff, then? :)
Sure. Do you have a NUMA machine to test on? Cheers, Simon

On March 28, 2012 04:41:16 Simon Marlow wrote:
Sure. Do you have a NUMA machine to test on?
My understanding is non-NUMA machines went away when the AMD and Intel moved away from frontside buses (FSB) and integrated the memory controllers on die. Intel is more recent to this game. I believe AMD's last non-NUMA machines where the Athalon XP series and Intel's the Core 2 series. An easy way to see what you've got is to see what 'numactl --hardware' says. If the node distance matrix is not uniform, you have NUMA hardware. As an example, on a 8 socket Opteron machine (32 cores) you get $ numactl --hardware available: 8 nodes (0-7) node 0 size: 16140 MB node 0 free: 3670 MB node 1 size: 16160 MB node 1 free: 3472 MB node 2 size: 16160 MB node 2 free: 4749 MB node 3 size: 16160 MB node 3 free: 4542 MB node 4 size: 16160 MB node 4 free: 3110 MB node 5 size: 16160 MB node 5 free: 1963 MB node 6 size: 16160 MB node 6 free: 1715 MB node 7 size: 16160 MB node 7 free: 2862 MB node distances: node 0 1 2 3 4 5 6 7 0: 10 20 20 20 20 20 20 20 1: 20 10 20 20 20 20 20 20 2: 20 20 10 20 20 20 20 20 3: 20 20 20 10 20 20 20 20 4: 20 20 20 20 10 20 20 20 5: 20 20 20 20 20 10 20 20 6: 20 20 20 20 20 20 10 20 7: 20 20 20 20 20 20 20 10 On our more traditional NUMA there are 64 nodes and the numbers range from 10-37. But it's an older SGI Itanium solution, so that comes with its own set of problems, and most modern machines are already out performing it. Cheers! -Tyson

Tyson Whitehead
Intel is more recent to this game. I believe AMD's last non-NUMA machines where the Athalon XP series and Intel's the Core 2 series.
An easy way to see what you've got is to see what 'numactl --hardware' says.
Ah, thanks. I trust this one qualifies? $ numactl --hardware available: 4 nodes (0-3) node 0 cpus: 0 4 8 12 16 20 24 28 node 0 size: 16370 MB node 0 free: 14185 MB node 1 cpus: 1 5 9 13 17 21 25 29 node 1 size: 16384 MB node 1 free: 10071 MB node 2 cpus: 2 6 10 14 18 22 26 30 node 2 size: 16384 MB node 2 free: 14525 MB node 3 cpus: 3 7 11 15 19 23 27 31 node 3 size: 16384 MB node 3 free: 13598 MB node distances: node 0 1 2 3 0: 10 20 20 20 1: 20 10 20 20 2: 20 20 10 20 3: 20 20 20 10 /proc/cpuinfo says there 32 "Intel(R) Xeon(R) CPU E7- 4830" CPUs. And here's result from "lstopo": Machine (64GB) NUMANode #0 (phys=0 16GB) + Socket #0 + L3 #0 (24MB) L2 #0 (256KB) + L1 #0 (32KB) + Core #0 + PU #0 (phys=0) L2 #1 (256KB) + L1 #1 (32KB) + Core #1 + PU #1 (phys=4) L2 #2 (256KB) + L1 #2 (32KB) + Core #2 + PU #2 (phys=8) L2 #3 (256KB) + L1 #3 (32KB) + Core #3 + PU #3 (phys=12) L2 #4 (256KB) + L1 #4 (32KB) + Core #4 + PU #4 (phys=16) L2 #5 (256KB) + L1 #5 (32KB) + Core #5 + PU #5 (phys=20) L2 #6 (256KB) + L1 #6 (32KB) + Core #6 + PU #6 (phys=24) L2 #7 (256KB) + L1 #7 (32KB) + Core #7 + PU #7 (phys=28) NUMANode #1 (phys=1 16GB) + Socket #1 + L3 #1 (24MB) L2 #8 (256KB) + L1 #8 (32KB) + Core #8 + PU #8 (phys=1) L2 #9 (256KB) + L1 #9 (32KB) + Core #9 + PU #9 (phys=5) L2 #10 (256KB) + L1 #10 (32KB) + Core #10 + PU #10 (phys=9) L2 #11 (256KB) + L1 #11 (32KB) + Core #11 + PU #11 (phys=13) L2 #12 (256KB) + L1 #12 (32KB) + Core #12 + PU #12 (phys=17) L2 #13 (256KB) + L1 #13 (32KB) + Core #13 + PU #13 (phys=21) L2 #14 (256KB) + L1 #14 (32KB) + Core #14 + PU #14 (phys=25) L2 #15 (256KB) + L1 #15 (32KB) + Core #15 + PU #15 (phys=29) NUMANode #2 (phys=2 16GB) + Socket #2 + L3 #2 (24MB) L2 #16 (256KB) + L1 #16 (32KB) + Core #16 + PU #16 (phys=2) L2 #17 (256KB) + L1 #17 (32KB) + Core #17 + PU #17 (phys=6) L2 #18 (256KB) + L1 #18 (32KB) + Core #18 + PU #18 (phys=10) L2 #19 (256KB) + L1 #19 (32KB) + Core #19 + PU #19 (phys=14) L2 #20 (256KB) + L1 #20 (32KB) + Core #20 + PU #20 (phys=18) L2 #21 (256KB) + L1 #21 (32KB) + Core #21 + PU #21 (phys=22) L2 #22 (256KB) + L1 #22 (32KB) + Core #22 + PU #22 (phys=26) L2 #23 (256KB) + L1 #23 (32KB) + Core #23 + PU #23 (phys=30) NUMANode #3 (phys=3 16GB) + Socket #3 + L3 #3 (24MB) L2 #24 (256KB) + L1 #24 (32KB) + Core #24 + PU #24 (phys=3) L2 #25 (256KB) + L1 #25 (32KB) + Core #25 + PU #25 (phys=7) L2 #26 (256KB) + L1 #26 (32KB) + Core #26 + PU #26 (phys=11) L2 #27 (256KB) + L1 #27 (32KB) + Core #27 + PU #27 (phys=15) L2 #28 (256KB) + L1 #28 (32KB) + Core #28 + PU #28 (phys=19) L2 #29 (256KB) + L1 #29 (32KB) + Core #29 + PU #29 (phys=23) L2 #30 (256KB) + L1 #30 (32KB) + Core #30 + PU #30 (phys=27) L2 #31 (256KB) + L1 #31 (32KB) + Core #31 + PU #31 (phys=31) -- "the lyf so short, the craft so long to lerne." -- Chaucer.

On March 28, 2012 12:40:02 Sajith T S wrote:
Tyson Whitehead
wrote: Intel is more recent to this game. I believe AMD's last non-NUMA machines where the Athalon XP series and Intel's the Core 2 series.
An easy way to see what you've got is to see what 'numactl --hardware' says.
Ah, thanks. I trust this one qualifies?
$ numactl --hardware available: 4 nodes (0-3) node 0 cpus: 0 4 8 12 16 20 24 28 node 0 size: 16370 MB node 0 free: 14185 MB node 1 cpus: 1 5 9 13 17 21 25 29 node 1 size: 16384 MB node 1 free: 10071 MB node 2 cpus: 2 6 10 14 18 22 26 30 node 2 size: 16384 MB node 2 free: 14525 MB node 3 cpus: 3 7 11 15 19 23 27 31 node 3 size: 16384 MB node 3 free: 13598 MB node distances: node 0 1 2 3 0: 10 20 20 20 1: 20 10 20 20 2: 20 20 10 20 3: 20 20 20 10
Yup. For sure. Here's an example from a 4 socket 16 core non-NUMA Intel Xeon $ numactl --hardware available: 1 nodes (0) node 0 size: 129260 MB node 0 free: 1304 MB node distances: node 0 0: 10 On a NUMA system I believe you should be able to get an idea of the worst case penality a program experiences due to having to do it's memory acesses all go across the QuickPath Interconnect (Intel)/HyperTransport (AMD) by forcing it to execute on one socket while using the memory on another. $ numactl --cpunodebind=0 --membind=1 <program> There is also some good information under /proc/<PID>/numa_maps. See the man page for details, but basically it tells you how many pages are associated with each node for each part of the programs address space.. Note that for file backed pages, they don't always reside in the limited node due to the system already having mapped them into memory on another node for an earlier process. Appoligies if you are already familair with these items. Cheers! -Tyson PS: That looks like a pretty sweet box you've got going there. :)

Tyson Whitehead
Appoligies if you are already familair with these items.
I wasn't, until today. Thanks for all these tips! They're all going to be mightily useful.
PS: That looks like a pretty sweet box you've got going there. :)
It is pretty muscular, yeah. :) -- "the lyf so short, the craft so long to lerne." -- Chaucer.

On 28/03/2012 16:57, Tyson Whitehead wrote:
On March 28, 2012 04:41:16 Simon Marlow wrote:
Sure. Do you have a NUMA machine to test on?
My understanding is non-NUMA machines went away when the AMD and Intel moved away from frontside buses (FSB) and integrated the memory controllers on die.
Intel is more recent to this game. I believe AMD's last non-NUMA machines where the Athalon XP series and Intel's the Core 2 series.
An easy way to see what you've got is to see what 'numactl --hardware' says. If the node distance matrix is not uniform, you have NUMA hardware.
As an example, on a 8 socket Opteron machine (32 cores) you get
$ numactl --hardware available: 8 nodes (0-7) node 0 size: 16140 MB node 0 free: 3670 MB node 1 size: 16160 MB node 1 free: 3472 MB node 2 size: 16160 MB node 2 free: 4749 MB node 3 size: 16160 MB node 3 free: 4542 MB node 4 size: 16160 MB node 4 free: 3110 MB node 5 size: 16160 MB node 5 free: 1963 MB node 6 size: 16160 MB node 6 free: 1715 MB node 7 size: 16160 MB node 7 free: 2862 MB node distances: node 0 1 2 3 4 5 6 7 0: 10 20 20 20 20 20 20 20 1: 20 10 20 20 20 20 20 20 2: 20 20 10 20 20 20 20 20 3: 20 20 20 10 20 20 20 20 4: 20 20 20 20 10 20 20 20 5: 20 20 20 20 20 10 20 20 6: 20 20 20 20 20 20 10 20 7: 20 20 20 20 20 20 20 10
Well, you learn something new every day! On the new 32-core Opteron box we have here: available: 8 nodes (0-7) node 0 cpus: 0 4 8 12 node 0 size: 8182 MB node 0 free: 1994 MB node 1 cpus: 16 20 24 28 node 1 size: 8192 MB node 1 free: 2783 MB node 2 cpus: 3 7 11 15 node 2 size: 8192 MB node 2 free: 2961 MB node 3 cpus: 19 23 27 31 node 3 size: 8192 MB node 3 free: 5359 MB node 4 cpus: 2 6 10 14 node 4 size: 8192 MB node 4 free: 3030 MB node 5 cpus: 18 22 26 30 node 5 size: 8192 MB node 5 free: 4667 MB node 6 cpus: 1 5 9 13 node 6 size: 8192 MB node 6 free: 3240 MB node 7 cpus: 17 21 25 29 node 7 size: 8192 MB node 7 free: 4031 MB node distances: node 0 1 2 3 4 5 6 7 0: 10 16 16 22 16 22 16 22 1: 16 10 16 22 22 16 22 16 2: 16 16 10 16 16 16 16 22 3: 22 22 16 10 16 16 22 16 4: 16 22 16 16 10 16 16 16 5: 22 16 16 16 16 10 22 22 6: 16 22 16 22 16 22 10 16 7: 22 16 22 16 16 22 16 10 The node distances on this box are less uniform than yours. Cheers, Simon
participants (3)
-
Sajith T S
-
Simon Marlow
-
Tyson Whitehead