
#8224: Excessive system time -- new IO manager problem? -------------------------------------+------------------------------------- Reporter: rrnewton | Owner: Type: bug | Status: new Priority: high | Milestone: 7.8.1 Component: Runtime System | Version: 7.7 Resolution: | Keywords: IO Manager, Operating System: Linux | System Time Type of failure: Runtime | Architecture: x86_64 (amd64) performance bug | Difficulty: Unknown Test Case: | Blocked By: Blocking: | Related Tickets: -------------------------------------+------------------------------------- Comment (by AndreasVoellmy): One interesting thing I just noticed is that `taskset` makes a big difference. `excessive_system_2_HEAD` is the empty program compiled with HEAD and `excessive_system_2_ONEACTIVELOOP` is the empty program compiled with only one IO manager loop being started. We see that for `excessive_system_2_HEAD` if we use `taskset` to restrict the threads to run on CPUs 0-9 (all on one NUMA node) we get a big difference: {{{ $ time ./excessive_system_2_HEAD +RTS -N32 real 0m0.046s user 0m0.020s sys 0m0.188s }}} {{{ $ time taskset -c 0-9 ./excessive_system_2_HEAD +RTS -N32 real 0m0.025s user 0m0.000s sys 0m0.056s }}} `excessive_system_2_ONEACTIVELOOP` also performs better with taskset but the difference is much smaller. Without taskset I see system tim from 30ms-60ms and with taskset it is about 10-20ms. So maybe the explanation is that with 31 extra safe foreign calls (from calls to `epoll_wait`) we get about 31 extra OS threads and if we are unlucky and the OS schedules these on different NUMA nodes we will get cache issues slowing down. And finally a spin lock burns longer accounting for more system time? Here is a run of `excessive_system_2_HEAD` using taskset to restrict to first NUMA node or no restrictions with perf stat: {{{ $ perf_3.2.0-25 stat taskset -c 0-9 ./excessive_system_2_HEAD +RTS -N32 Performance counter stats for 'taskset -c 0-9 ./excessive_system_2_HEAD +RTS -N32': 25.019318 task-clock # 0.900 CPUs utilized 584 context-switches # 0.023 M/sec 218 CPU-migrations # 0.009 M/sec 1,248 page-faults # 0.050 M/sec 36,387,323 cycles # 1.454 GHz [62.17%] 26,057,951 stalled-cycles-frontend # 71.61% frontend cycles idle [98.77%] 17,556,988 stalled-cycles-backend # 48.25% backend cycles idle 22,358,612 instructions # 0.61 insns per cycle # 1.17 stalled cycles per insn 4,300,427 branches # 171.884 M/sec 169,950 branch-misses # 3.95% of all branches [42.87%] 0.027811151 seconds time elapsed }}} {{{ $ perf_3.2.0-25 stat taskset -c 0-79 ./excessive_system_2_HEAD +RTS -N32 Performance counter stats for 'taskset -c 0-79 ./excessive_system_2_HEAD +RTS -N32': 157.684408 task-clock # 3.055 CPUs utilized 588 context-switches # 0.004 M/sec 147 CPU-migrations # 0.001 M/sec 1,245 page-faults # 0.008 M/sec 267,826,389 cycles # 1.698 GHz [54.94%] 227,802,954 stalled-cycles-frontend # 85.06% frontend cycles idle [66.06%] 184,677,238 stalled-cycles-backend # 68.95% backend cycles idle [86.81%] 70,842,584 instructions # 0.26 insns per cycle # 3.22 stalled cycles per insn [98.87%] 16,579,295 branches # 105.142 M/sec [96.70%] 233,921 branch-misses # 1.41% of all branches [97.31%] 0.051618204 seconds time elapsed }}} It is interesting that without restriction we get many more instructions executed and also fewer instructions per cycle. Also interesting, here are the top few entries perf recording with taskset 0-9 {{{ 5.64% excessive_syste [kernel.kallsyms] [k] update_sd_lb_stats 5.47% excessive_syste [kernel.kallsyms] [k] resched_task 4.97% excessive_syste excessive_system_2_HEAD [.] evacuate1 4.84% excessive_syste [kernel.kallsyms] [k] __ticket_spin_lock 4.18% excessive_syste libc-2.15.so [.] _int_malloc 3.85% excessive_syste [kernel.kallsyms] [k] __init_waitqueue_head 3.81% excessive_syste [kernel.kallsyms] [k] dequeue_entity 3.48% excessive_syste [kernel.kallsyms] [k] find_vma 3.14% excessive_syste [kernel.kallsyms] [k] native_write_msr_safe 3.12% excessive_syste [kernel.kallsyms] [k] source_load 3.12% excessive_syste [kernel.kallsyms] [k] idle_cpu 2.62% excessive_syste [kernel.kallsyms] [k] update_cfs_rq_blocked_load 2.33% excessive_syste [kernel.kallsyms] [k] perf_event_mmap_ctx 2.13% excessive_syste [kernel.kallsyms] [k] kmem_cache_alloc 2.07% excessive_syste excessive_system_2_HEAD [.] schedule }}} and with no restriction {{{ 43.82% excessive_syste [kernel.kallsyms] [k] __ticket_spin_lock 6.44% excessive_syste [kernel.kallsyms] [k] default_send_IPI_mask_sequence_phys 4.68% excessive_syste [kernel.kallsyms] [k] native_write_msr_safe 3.45% excessive_syste [kernel.kallsyms] [k] update_sd_lb_stats 3.31% excessive_syste libc-2.15.so [.] new_heap 2.86% excessive_syste [kernel.kallsyms] [k] idle_cpu 2.40% excessive_syste [kernel.kallsyms] [k] perf_event_mmap 1.61% excessive_syste [kernel.kallsyms] [k] __schedule 1.60% excessive_syste excessive_system_2_HEAD [.] s8Bc_info 1.56% excessive_syste [kernel.kallsyms] [k] inherit_event.isra.73 1.50% excessive_syste [kernel.kallsyms] [k] __init_waitqueue_head 1.49% excessive_syste libc-2.15.so [.] __clone 1.45% excessive_syste [kernel.kallsyms] [k] clear_page_c 1.44% excessive_syste [kernel.kallsyms] [k] try_to_wake_up }}} Pretty clearly spending more time in a spin lock when we run on all NUMA nodes. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/8224#comment:17 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler