Re: [GHC] #8224: Excessive system time -- new IO manager problem?

13 Nov 2013

      #8224: Excessive system time -- new IO manager problem?
-------------------------------------+-------------------------------------
        Reporter:  rrnewton          |            Owner:
            Type:  bug               |           Status:  new
        Priority:  high              |        Milestone:  7.8.1
       Component:  Runtime System    |          Version:  7.7
      Resolution:                    |         Keywords:  IO Manager,
Operating System:  Linux             |  System Time
 Type of failure:  Runtime           |     Architecture:  x86_64 (amd64)
  performance bug                    |       Difficulty:  Unknown
       Test Case:                    |       Blocked By:
        Blocking:                    |  Related Tickets:
-------------------------------------+-------------------------------------

Comment (by AndreasVoellmy):

 One interesting thing I just noticed is that `taskset` makes a big
 difference. `excessive_system_2_HEAD` is the empty program compiled with
 HEAD and `excessive_system_2_ONEACTIVELOOP` is the empty program compiled
 with only one IO manager loop being started. We see that for
 `excessive_system_2_HEAD` if we use `taskset` to restrict the threads to
 run on CPUs 0-9 (all on one NUMA node) we get a big difference:

 {{{
 $ time ./excessive_system_2_HEAD +RTS -N32

 real    0m0.046s
 user    0m0.020s
 sys     0m0.188s
 }}}

 {{{
 $ time taskset -c 0-9 ./excessive_system_2_HEAD +RTS -N32

 real    0m0.025s
 user    0m0.000s
 sys     0m0.056s
 }}}

 `excessive_system_2_ONEACTIVELOOP` also performs better with taskset but
 the difference is much smaller. Without taskset I see system tim from
 30ms-60ms and with taskset it is about 10-20ms.

 So maybe the explanation is that with 31 extra safe foreign calls (from
 calls to `epoll_wait`) we get about 31 extra OS threads and if we are
 unlucky and the OS schedules these on different NUMA nodes we will get
 cache issues slowing down. And finally a spin lock burns longer accounting
 for more system time?

 Here is a run of `excessive_system_2_HEAD` using taskset to restrict to
 first NUMA node or no restrictions with perf stat:
 {{{
 $ perf_3.2.0-25 stat taskset -c 0-9 ./excessive_system_2_HEAD +RTS -N32

  Performance counter stats for 'taskset -c 0-9 ./excessive_system_2_HEAD
 +RTS -N32':

          25.019318 task-clock                #    0.900 CPUs utilized
                584 context-switches          #    0.023 M/sec
                218 CPU-migrations            #    0.009 M/sec
              1,248 page-faults               #    0.050 M/sec
         36,387,323 cycles                    #    1.454 GHz
 [62.17%]
         26,057,951 stalled-cycles-frontend   #   71.61% frontend cycles
 idle    [98.77%]
         17,556,988 stalled-cycles-backend    #   48.25% backend  cycles
 idle
         22,358,612 instructions              #    0.61  insns per cycle
                                              #    1.17  stalled cycles per
 insn
          4,300,427 branches                  #  171.884 M/sec
            169,950 branch-misses             #    3.95% of all branches
 [42.87%]

        0.027811151 seconds time elapsed
 }}}

 {{{
 $ perf_3.2.0-25 stat taskset -c 0-79 ./excessive_system_2_HEAD +RTS -N32

  Performance counter stats for 'taskset -c 0-79 ./excessive_system_2_HEAD
 +RTS -N32':

         157.684408 task-clock                #    3.055 CPUs utilized
                588 context-switches          #    0.004 M/sec
                147 CPU-migrations            #    0.001 M/sec
              1,245 page-faults               #    0.008 M/sec
        267,826,389 cycles                    #    1.698 GHz
 [54.94%]
        227,802,954 stalled-cycles-frontend   #   85.06% frontend cycles
 idle    [66.06%]
        184,677,238 stalled-cycles-backend    #   68.95% backend  cycles
 idle    [86.81%]
         70,842,584 instructions              #    0.26  insns per cycle
                                              #    3.22  stalled cycles per
 insn [98.87%]
         16,579,295 branches                  #  105.142 M/sec
 [96.70%]
            233,921 branch-misses             #    1.41% of all branches
 [97.31%]

        0.051618204 seconds time elapsed
 }}}

 It is interesting that without restriction we get many more instructions
 executed and also fewer instructions per cycle.

 Also interesting, here are the top few entries perf recording with taskset
 0-9
 {{{
  5.64%  excessive_syste  [kernel.kallsyms]            [k]
 update_sd_lb_stats
   5.47%  excessive_syste  [kernel.kallsyms]            [k] resched_task
   4.97%  excessive_syste  excessive_system_2_HEAD      [.] evacuate1
   4.84%  excessive_syste  [kernel.kallsyms]            [k]
 __ticket_spin_lock
   4.18%  excessive_syste  libc-2.15.so                 [.] _int_malloc
   3.85%  excessive_syste  [kernel.kallsyms]            [k]
 __init_waitqueue_head
   3.81%  excessive_syste  [kernel.kallsyms]            [k] dequeue_entity
   3.48%  excessive_syste  [kernel.kallsyms]            [k] find_vma
   3.14%  excessive_syste  [kernel.kallsyms]            [k]
 native_write_msr_safe
   3.12%  excessive_syste  [kernel.kallsyms]            [k] source_load
   3.12%  excessive_syste  [kernel.kallsyms]            [k] idle_cpu
   2.62%  excessive_syste  [kernel.kallsyms]            [k]
 update_cfs_rq_blocked_load
   2.33%  excessive_syste  [kernel.kallsyms]            [k]
 perf_event_mmap_ctx
   2.13%  excessive_syste  [kernel.kallsyms]            [k]
 kmem_cache_alloc
   2.07%  excessive_syste  excessive_system_2_HEAD      [.] schedule
 }}}

 and with no restriction
 {{{
  43.82%  excessive_syste  [kernel.kallsyms]        [k] __ticket_spin_lock
   6.44%  excessive_syste  [kernel.kallsyms]        [k]
 default_send_IPI_mask_sequence_phys
   4.68%  excessive_syste  [kernel.kallsyms]        [k]
 native_write_msr_safe
   3.45%  excessive_syste  [kernel.kallsyms]        [k] update_sd_lb_stats
   3.31%  excessive_syste  libc-2.15.so             [.] new_heap
   2.86%  excessive_syste  [kernel.kallsyms]        [k] idle_cpu
   2.40%  excessive_syste  [kernel.kallsyms]        [k] perf_event_mmap
   1.61%  excessive_syste  [kernel.kallsyms]        [k] __schedule
   1.60%  excessive_syste  excessive_system_2_HEAD  [.] s8Bc_info
   1.56%  excessive_syste  [kernel.kallsyms]        [k]
 inherit_event.isra.73
   1.50%  excessive_syste  [kernel.kallsyms]        [k]
 __init_waitqueue_head
   1.49%  excessive_syste  libc-2.15.so             [.] __clone
   1.45%  excessive_syste  [kernel.kallsyms]        [k] clear_page_c
   1.44%  excessive_syste  [kernel.kallsyms]        [k] try_to_wake_up
 }}}

 Pretty clearly spending more time in a spin lock when we run on all NUMA
 nodes.

--
Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/8224#comment:17
GHC http://www.haskell.org/ghc/
The Glasgow Haskell Compiler