[GHC] #8224: Excessive system time -- new IO manager problem?

#8224: Excessive system time -- new IO manager problem? -------------------------------------+------------------------------------- Reporter: rrnewton | Owner: Type: bug | Status: new Priority: high | Milestone: 7.8.1 Component: Runtime System | Version: 7.7 Keywords: IO Manager, | Operating System: Linux System Time | Type of failure: Runtime Architecture: x86_64 (amd64) | performance bug Difficulty: Unknown | Test Case: Blocked By: | Blocking: Related Tickets: | -------------------------------------+------------------------------------- This is an issue that came to light when testing the patches on #910. You can see some of the numbers there. Basically, recent GHC HEAD builds, running a large number of threads blocked on MVars will result in burning a lot of system time. The attached file provides a mediocre reproducer. With it, you can see that building with HEAD as of Aug 31st and running with `-RTS -N32` will result in around 200ms system time, whereas GHC 7.6.3 spends about 30ms in system time. This shows the disparity, but the result is not that egregious. A more noticeable example is on ticket #910, where when running on 31 threads, there is an 8 minutes of system time for 17 minutes of user time, and yet at one thread that system time drops to under two seconds! {{{ 1 thread: real 1m20.028s user 1m17.921s sys 0m1.768s 31 threads: real 1m27.445s user 17m0.314s sys 8m0.175s }}} It needs to be determined whether this system time is a result of the parallel compilation patches on #910, or whether it is a new problem with the runtime system, and in particular with the parallel IO manager. I am inclined to believe that compiling in parallel involves extra IO (repeatedly reading interface files?), but not eight minutes of it!! -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/8224 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#8224: Excessive system time -- new IO manager problem? -------------------------------------+------------------------------------- Reporter: rrnewton | Owner: Type: bug | Status: new Priority: high | Milestone: 7.8.1 Component: Runtime System | Version: 7.7 Resolution: | Keywords: IO Manager, Operating System: Linux | System Time Type of failure: Runtime | Architecture: x86_64 (amd64) performance bug | Difficulty: Unknown Test Case: | Blocked By: Blocking: | Related Tickets: -------------------------------------+------------------------------------- Changes (by AndreasVoellmy): * cc: andreas.voellmy@… (added) -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/8224#comment:1 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#8224: Excessive system time -- new IO manager problem? -------------------------------------+------------------------------------- Reporter: rrnewton | Owner: Type: bug | Status: new Priority: high | Milestone: 7.8.1 Component: Runtime System | Version: 7.7 Resolution: | Keywords: IO Manager, Operating System: Linux | System Time Type of failure: Runtime | Architecture: x86_64 (amd64) performance bug | Difficulty: Unknown Test Case: | Blocked By: Blocking: | Related Tickets: -------------------------------------+------------------------------------- Comment (by AndreasVoellmy): I also see the 200ms system time for excessive_system.hs when I run it with -N32 (and compiled with recent HEAD). Below is the first few entries after running perf record; it looks like it may be GC related. I don't see any IO manager related entries in the list. {{{ 39.02% excessive_syste excessive_system [.] gcWorkerThread 15.53% excessive_syste [kernel.kallsyms] [k] __ticket_spin_lock 6.10% excessive_syste [kernel.kallsyms] [k] default_send_IPI_mask_sequence_phys 5.88% excessive_syste excessive_system [.] allocBlock_sync 3.95% excessive_syste [kernel.kallsyms] [k] update_sd_lb_stats 2.76% excessive_syste excessive_system [.] steal_todo_block 2.18% excessive_syste [kernel.kallsyms] [k] native_write_msr_safe 1.65% excessive_syste libc-2.15.so [.] 0x7e21f 1.52% excessive_syste excessive_system [.] evacuate 0.75% excessive_syste [kernel.kallsyms] [k] dequeue_task_fair 0.74% excessive_syste [kernel.kallsyms] [k] do_mmap_pgoff 0.67% excessive_syste libc-2.15.so [.] __clone 0.67% excessive_syste [kernel.kallsyms] [k] memset 0.66% excessive_syste [kernel.kallsyms] [k] dequeue_entity 0.66% excessive_syste [kernel.kallsyms] [k] release_pages 0.62% excessive_syste [kernel.kallsyms] [k] find_next_bit 0.58% excessive_syste excessive_system [.] scavenge_until_all_done 0.56% excessive_syste [kernel.kallsyms] [k] __schedule 0.54% excessive_syste [kernel.kallsyms] [k] hrtimer_interrupt }}} -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/8224#comment:2 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#8224: Excessive system time -- new IO manager problem? -------------------------------------+------------------------------------- Reporter: rrnewton | Owner: Type: bug | Status: new Priority: high | Milestone: 7.8.1 Component: Runtime System | Version: 7.7 Resolution: | Keywords: IO Manager, Operating System: Linux | System Time Type of failure: Runtime | Architecture: x86_64 (amd64) performance bug | Difficulty: Unknown Test Case: | Blocked By: Blocking: | Related Tickets: -------------------------------------+------------------------------------- Comment (by thoughtpolice): I haven't tested this (don't have a 32 core machine,) but if the `gcWorkerThread` is causing problems, what does it look like when you turn off the multicore collector and load balancer? i.e. run with: {{{ $ ./excessive_system +RTS -N -qb -qg }}} -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/8224#comment:3 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#8224: Excessive system time -- new IO manager problem? -------------------------------------+------------------------------------- Reporter: rrnewton | Owner: Type: bug | Status: new Priority: high | Milestone: 7.8.1 Component: Runtime System | Version: 7.7 Resolution: | Keywords: IO Manager, Operating System: Linux | System Time Type of failure: Runtime | Architecture: x86_64 (amd64) performance bug | Difficulty: Unknown Test Case: | Blocked By: Blocking: | Related Tickets: -------------------------------------+------------------------------------- Comment (by AndreasVoellmy): Good idea. I still see about the same amount of system time. The {{{__ticket_spin_lock}}} is still at the top: {{{ Events: 1K cycles # # Overhead Command Shared Object Symbol # ........ ............... .................. ................................... # 34.41% excessive_syste [kernel.kallsyms] [k] __ticket_spin_lock 11.94% excessive_syste [kernel.kallsyms] [k] default_send_IPI_mask_sequence_phys 4.48% excessive_syste [kernel.kallsyms] [k] update_sd_lb_stats 3.39% excessive_syste [kernel.kallsyms] [k] native_write_msr_safe 1.86% excessive_syste [kernel.kallsyms] [k] __slab_free 1.85% excessive_syste [kernel.kallsyms] [k] idle_cpu 1.55% excessive_syste libc-2.15.so [.] 0x7e175 1.39% excessive_syste [kernel.kallsyms] [k] perf_event_alloc 1.36% excessive_syste [kernel.kallsyms] [k] dequeue_task_fair 1.21% :47802 [kernel.kallsyms] [k] page_remove_rmap 1.09% excessive_syste [kernel.kallsyms] [k] mmap_region 0.99% excessive_syste [kernel.kallsyms] [k] __init_waitqueue_head 0.94% excessive_syste [kernel.kallsyms] [k] memset 0.92% excessive_syste [kernel.kallsyms] [k] do_futex 0.90% excessive_syste [kernel.kallsyms] [k] perf_event_mmap_ctx 0.89% excessive_syste [kernel.kallsyms] [k] kmem_cache_alloc_trace 0.87% excessive_syste libpthread-2.15.so [.] __deallocate_stack 0.86% excessive_syste excessive_system [.] waitCondition 0.86% excessive_syste [kernel.kallsyms] [k] __rwsem_do_wake 0.85% excessive_syste [kernel.kallsyms] [k] get_page_from_freelist 0.84% excessive_syste [kernel.kallsyms] [k] find_vma 0.84% excessive_syste excessive_system [.] s6t3_info 0.83% excessive_syste [kernel.kallsyms] [k] sys_read 0.82% excessive_syste [kernel.kallsyms] [k] kfree 0.78% excessive_syste [kernel.kallsyms] [k] __raw_write_unlock_irq.constprop.31 }}} -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/8224#comment:4 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#8224: Excessive system time -- new IO manager problem? -------------------------------------+------------------------------------- Reporter: rrnewton | Owner: Type: bug | Status: new Priority: high | Milestone: 7.8.1 Component: Runtime System | Version: 7.7 Resolution: | Keywords: IO Manager, Operating System: Linux | System Time Type of failure: Runtime | Architecture: x86_64 (amd64) performance bug | Difficulty: Unknown Test Case: | Blocked By: Blocking: | Related Tickets: -------------------------------------+------------------------------------- Comment (by AndreasVoellmy): Even with MVars removed and each thread simply waiting (see attached program) it takes 200ms system time with -N32. Here is a perf recording with call stack info; here is most of the first entry: {{{ # Events: 1K cycles # # Overhead Command Shared Object Symbol # ........ ............... .................. ................................... # 51.37% excessive_syste [kernel.kallsyms] [k] __ticket_spin_lock | --- __ticket_spin_lock | |--88.44%-- _raw_spin_lock_irq | | | |--89.17%-- get_signal_to_deliver | | do_signal | | do_notify_resume | | int_signal | | | | | |--94.68%-- __lll_lock_wait_private | | | | | --5.32%-- epoll_wait | | | | | |--67.64%-- 0x7ff06806d533 | | | 0x8758b4830558b48 | | | | | |--18.03%-- 0x7ff06806690b | | | 0x8758b4830558b48 | | | | | --14.33%-- 0x7ff068062b2b | | 0x8758b4830558b48 | | | |--8.08%-- acct_collect | | do_exit | | do_group_exit | | get_signal_to_deliver | | do_signal | | do_notify_resume | | int_signal | | __lll_lock_wait_private | | }}} -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/8224#comment:5 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#8224: Excessive system time -- new IO manager problem? -------------------------------------+------------------------------------- Reporter: rrnewton | Owner: Type: bug | Status: new Priority: high | Milestone: 7.8.1 Component: Runtime System | Version: 7.7 Resolution: | Keywords: IO Manager, Operating System: Linux | System Time Type of failure: Runtime | Architecture: x86_64 (amd64) performance bug | Difficulty: Unknown Test Case: | Blocked By: Blocking: | Related Tickets: -------------------------------------+------------------------------------- Comment (by AndreasVoellmy): Actually, I can simplify the program to just {{{ main = return () }}} and I get 200ms in kernel when run with -N32. I'm not sure how significant this 200ms time is - I guess it is time spent initializing or cleaning up capabilities (possibly including IO manager resources). -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/8224#comment:6 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#8224: Excessive system time -- new IO manager problem? -------------------------------------+------------------------------------- Reporter: rrnewton | Owner: Type: bug | Status: new Priority: high | Milestone: 7.8.1 Component: Runtime System | Version: 7.7 Resolution: | Keywords: IO Manager, Operating System: Linux | System Time Type of failure: Runtime | Architecture: x86_64 (amd64) performance bug | Difficulty: Unknown Test Case: | Blocked By: Blocking: | Related Tickets: -------------------------------------+------------------------------------- Comment (by rrnewton): Thanks for the perf data! Right, the 200ms isn't really bad. Can anyone reproduce the 8 *minutes* of system time I was seeing on the "ghc-parmake-gsoc" branch? I'd like to figure out whether that problem has anything to do with the patches on #910, or is unrelated. Was there a switch from some other kind of lock to a spin lock at some point? -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/8224#comment:7 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#8224: Excessive system time -- new IO manager problem? -------------------------------------+------------------------------------- Reporter: rrnewton | Owner: Type: bug | Status: new Priority: high | Milestone: 7.8.1 Component: Runtime System | Version: 7.7 Resolution: | Keywords: IO Manager, Operating System: Linux | System Time Type of failure: Runtime | Architecture: x86_64 (amd64) performance bug | Difficulty: Unknown Test Case: | Blocked By: Blocking: | Related Tickets: -------------------------------------+------------------------------------- Comment (by simonmar): 200ms spent doing unspecified "initialization" should definitely be investigated, that's a lot. I can't see anything useful in that call stack, it looks like it might be related to signals or epoll() but it's hard to tell anything else. What does the `+RTS -s` look like? -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/8224#comment:8 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#8224: Excessive system time -- new IO manager problem? -------------------------------------+------------------------------------- Reporter: rrnewton | Owner: Type: bug | Status: new Priority: high | Milestone: 7.8.1 Component: Runtime System | Version: 7.7 Resolution: | Keywords: IO Manager, Operating System: Linux | System Time Type of failure: Runtime | Architecture: x86_64 (amd64) performance bug | Difficulty: Unknown Test Case: | Blocked By: Blocking: | Related Tickets: -------------------------------------+------------------------------------- Comment (by simonmar): One reason for extra user/system time that could account for your results in #910 is that if there isn't enough parallelism during GC then you'll have 30 threads spinning looking for work, and doing the occasional sched_yield(). Spinning accounts for the user time, the sched_yield() accounts for the system time. Still, it does look like a lot, so it would be good to look at some !ThreadScope traces. That doesn't account for the 200ms in the empty program though, so there could be something else going on here. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/8224#comment:9 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#8224: Excessive system time -- new IO manager problem? -------------------------------------+------------------------------------- Reporter: rrnewton | Owner: Type: bug | Status: new Priority: high | Milestone: 7.8.1 Component: Runtime System | Version: 7.7 Resolution: | Keywords: IO Manager, Operating System: Linux | System Time Type of failure: Runtime | Architecture: x86_64 (amd64) performance bug | Difficulty: Unknown Test Case: | Blocked By: Blocking: | Related Tickets: -------------------------------------+------------------------------------- Changes (by alpmestan): * cc: alpmestan@… (added) -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/8224#comment:10 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#8224: Excessive system time -- new IO manager problem? -------------------------------------+------------------------------------- Reporter: rrnewton | Owner: Type: bug | Status: new Priority: high | Milestone: 7.8.1 Component: Runtime System | Version: 7.7 Resolution: | Keywords: IO Manager, Operating System: Linux | System Time Type of failure: Runtime | Architecture: x86_64 (amd64) performance bug | Difficulty: Unknown Test Case: | Blocked By: Blocking: | Related Tickets: -------------------------------------+------------------------------------- Changes (by kazu-yamamoto): * cc: kazu@…, simonmar (added) -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/8224#comment:11 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#8224: Excessive system time -- new IO manager problem? -------------------------------------+------------------------------------- Reporter: rrnewton | Owner: Type: bug | Status: new Priority: high | Milestone: 7.8.1 Component: Runtime System | Version: 7.7 Resolution: | Keywords: IO Manager, Operating System: Linux | System Time Type of failure: Runtime | Architecture: x86_64 (amd64) performance bug | Difficulty: Unknown Test Case: | Blocked By: Blocking: | Related Tickets: -------------------------------------+------------------------------------- Comment (by AndreasVoellmy): I'm not sure about the 8 minutes problem, but the 200ms system time for the empty program does indeed seem to be related to the parallel IO manager changes. I just modified the IO manager to start only one epoll instance and I get about 50 ms system time, whereas with HEAD I get about 200ms still. I will upload the patch to make the IO manager use just one poller in case anyone else wants to try it. Some data... With just one poller, strace -C gives: {{{ % time seconds usecs/call calls errors syscall ------ ----------- ----------- --------- --------- ---------------- 100.00 0.000046 1 40 getrusage 0.00 0.000000 0 7 read 0.00 0.000000 0 1 write 0.00 0.000000 0 51 42 open 0.00 0.000000 0 9 close 0.00 0.000000 0 8 6 stat 0.00 0.000000 0 8 fstat 0.00 0.000000 0 90 mmap 0.00 0.000000 0 46 mprotect 0.00 0.000000 0 15 munmap 0.00 0.000000 0 16 brk 0.00 0.000000 0 12 rt_sigaction 0.00 0.000000 0 7 rt_sigprocmask 0.00 0.000000 0 1 rt_sigreturn 0.00 0.000000 0 1 ioctl 0.00 0.000000 0 8 8 access 0.00 0.000000 0 1 pipe 0.00 0.000000 0 32 clone 0.00 0.000000 0 1 execve 0.00 0.000000 0 8 fcntl 0.00 0.000000 0 1 getrlimit 0.00 0.000000 0 1 getuid 0.00 0.000000 0 1 getgid 0.00 0.000000 0 1 geteuid 0.00 0.000000 0 1 getegid 0.00 0.000000 0 1 arch_prctl 0.00 0.000000 0 101 2 futex 0.00 0.000000 0 1 epoll_create 0.00 0.000000 0 1 set_tid_address 0.00 0.000000 0 1 timer_create 0.00 0.000000 0 2 timer_settime 0.00 0.000000 0 1 timer_delete 0.00 0.000000 0 2 epoll_ctl 0.00 0.000000 0 1 set_robust_list 0.00 0.000000 0 1 eventfd2 ------ ----------- ----------- --------- --------- ---------------- 100.00 0.000046 479 58 total }}} On the other hand, with N pollers (i.e. HEAD) strace -C gives: {{{ % time seconds usecs/call calls errors syscall ------ ----------- ----------- --------- --------- ---------------- 86.21 0.002356 21 114 2 futex 8.67 0.000237 18 13 rt_sigaction 5.12 0.000140 1 263 fcntl 0.00 0.000000 0 8 read 0.00 0.000000 0 1 write 0.00 0.000000 0 52 42 open 0.00 0.000000 0 10 close 0.00 0.000000 0 8 6 stat 0.00 0.000000 0 9 fstat 0.00 0.000000 0 74 mmap 0.00 0.000000 0 46 mprotect 0.00 0.000000 0 10 munmap 0.00 0.000000 0 19 brk 0.00 0.000000 0 5 rt_sigprocmask 0.00 0.000000 0 1 rt_sigreturn 0.00 0.000000 0 1 ioctl 0.00 0.000000 0 8 8 access 0.00 0.000000 0 33 pipe 0.00 0.000000 0 32 clone 0.00 0.000000 0 1 execve 0.00 0.000000 0 1 getrlimit 0.00 0.000000 0 1 getuid 0.00 0.000000 0 1 getgid 0.00 0.000000 0 1 geteuid 0.00 0.000000 0 1 getegid 0.00 0.000000 0 1 arch_prctl 0.00 0.000000 0 32 epoll_create 0.00 0.000000 0 1 set_tid_address 0.00 0.000000 0 1 timer_create 0.00 0.000000 0 2 timer_settime 0.00 0.000000 0 1 timer_delete 0.00 0.000000 0 4 clock_gettime 0.00 0.000000 0 64 epoll_ctl 0.00 0.000000 0 1 set_robust_list 0.00 0.000000 0 33 eventfd2 ------ ----------- ----------- --------- --------- ---------------- 100.00 0.002733 853 58 total }}} -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/8224#comment:12 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#8224: Excessive system time -- new IO manager problem? -------------------------------------+------------------------------------- Reporter: rrnewton | Owner: Type: bug | Status: new Priority: high | Milestone: 7.8.1 Component: Runtime System | Version: 7.7 Resolution: | Keywords: IO Manager, Operating System: Linux | System Time Type of failure: Runtime | Architecture: x86_64 (amd64) performance bug | Difficulty: Unknown Test Case: | Blocked By: Blocking: | Related Tickets: -------------------------------------+------------------------------------- Comment (by AndreasVoellmy): Replying to [comment:8 simonmar]:
200ms spent doing unspecified "initialization" should definitely be investigated, that's a lot. I can't see anything useful in that call stack, it looks like it might be related to signals or epoll() but it's hard to tell anything else.
What does the `+RTS -s` look like?
With my "one poller" hack: {{{ 48,624 bytes allocated in the heap 40 bytes copied during GC 38,592 bytes maximum residency (1 sample(s)) 268,608 bytes maximum slop 32 MB total memory in use (15 MB lost due to fragmentation) Tot time (elapsed) Avg pause Max pause Gen 0 0 colls, 0 par 0.00s 0.00s 0.0000s 0.0000s Gen 1 1 colls, 0 par 0.00s 0.00s 0.0003s 0.0003s Parallel GC work balance: -nan (0 / 0, ideal 32) MUT time (elapsed) GC time (elapsed) Task 0 (worker) : 0.00s ( 0.00s) 0.00s ( 0.00s) Task 1 (worker) : 0.00s ( 0.00s) 0.00s ( 0.00s) Task 2 (bound) : 0.00s ( 0.00s) 0.00s ( 0.00s) Task 3 (worker) : 0.01s ( 0.01s) 0.00s ( 0.00s) Task 4 (worker) : 0.02s ( 0.01s) 0.00s ( 0.00s) Task 5 (worker) : 0.01s ( 0.01s) 0.00s ( 0.00s) Task 6 (worker) : 0.01s ( 0.01s) 0.00s ( 0.00s) Task 7 (worker) : 0.01s ( 0.01s) 0.00s ( 0.00s) Task 8 (worker) : 0.02s ( 0.01s) 0.00s ( 0.00s) Task 9 (worker) : 0.01s ( 0.01s) 0.00s ( 0.00s) Task 10 (worker) : 0.02s ( 0.01s) 0.00s ( 0.00s) Task 11 (worker) : 0.01s ( 0.01s) 0.00s ( 0.00s) Task 12 (worker) : 0.02s ( 0.01s) 0.00s ( 0.00s) Task 13 (worker) : 0.01s ( 0.01s) 0.00s ( 0.00s) Task 14 (worker) : 0.01s ( 0.01s) 0.00s ( 0.00s) Task 15 (worker) : 0.01s ( 0.01s) 0.00s ( 0.00s) Task 16 (worker) : 0.01s ( 0.01s) 0.00s ( 0.00s) Task 17 (worker) : 0.01s ( 0.01s) 0.00s ( 0.00s) Task 18 (worker) : 0.01s ( 0.01s) 0.00s ( 0.00s) Task 19 (worker) : 0.01s ( 0.01s) 0.00s ( 0.00s) Task 20 (worker) : 0.01s ( 0.01s) 0.00s ( 0.00s) Task 21 (worker) : 0.01s ( 0.01s) 0.00s ( 0.00s) Task 22 (worker) : 0.01s ( 0.01s) 0.00s ( 0.00s) Task 23 (worker) : 0.01s ( 0.01s) 0.00s ( 0.00s) Task 24 (worker) : 0.00s ( 0.01s) 0.00s ( 0.00s) Task 25 (worker) : 0.00s ( 0.01s) 0.00s ( 0.00s) Task 26 (worker) : 0.00s ( 0.01s) 0.00s ( 0.00s) Task 27 (worker) : 0.00s ( 0.01s) 0.00s ( 0.00s) Task 28 (worker) : 0.01s ( 0.01s) 0.00s ( 0.00s) Task 29 (worker) : 0.01s ( 0.01s) 0.00s ( 0.00s) Task 30 (worker) : 0.01s ( 0.01s) 0.00s ( 0.00s) Task 31 (worker) : 0.01s ( 0.01s) 0.00s ( 0.00s) Task 32 (worker) : 0.01s ( 0.01s) 0.00s ( 0.00s) Task 33 (worker) : 0.01s ( 0.01s) 0.00s ( 0.00s) SPARKS: 0 (0 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled) INIT time 0.00s ( 0.01s elapsed) MUT time 0.00s ( 0.00s elapsed) GC time 0.00s ( 0.00s elapsed) EXIT time 0.01s ( 0.00s elapsed) Total time 0.02s ( 0.01s elapsed) Alloc rate 2,431,200 bytes per MUT second Productivity 80.0% of total user, 132.3% of total elapsed gc_alloc_block_sync: 0 whitehole_spin: 0 gen[0].sync: 0 gen[1].sync: 0 }}} With HEAD: {{{ 543,096 bytes allocated in the heap 64 bytes copied during GC 412,256 bytes maximum residency (1 sample(s)) 279,968 bytes maximum slop 18 MB total memory in use (0 MB lost due to fragmentation) Tot time (elapsed) Avg pause Max pause Gen 0 0 colls, 0 par 0.00s 0.00s 0.0000s 0.0000s Gen 1 1 colls, 0 par 0.00s 0.00s 0.0009s 0.0009s Parallel GC work balance: -nan% (serial 0%, perfect 100%) TASKS: 65 (1 bound, 64 peak workers (64 total), using -N32) SPARKS: 0 (0 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled) INIT time 0.04s ( 0.03s elapsed) MUT time 0.00s ( -0.00s elapsed) GC time 0.00s ( 0.00s elapsed) EXIT time 0.01s ( 0.00s elapsed) Total time 0.05s ( 0.04s elapsed) Alloc rate 0 bytes per MUT second Productivity 17.4% of total user, 24.7% of total elapsed gc_alloc_block_sync: 0 whitehole_spin: 0 gen[0].sync: 0 gen[1].sync: 0 }}} -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/8224#comment:13 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#8224: Excessive system time -- new IO manager problem? -------------------------------------+------------------------------------- Reporter: rrnewton | Owner: Type: bug | Status: new Priority: high | Milestone: 7.8.1 Component: Runtime System | Version: 7.7 Resolution: | Keywords: IO Manager, Operating System: Linux | System Time Type of failure: Runtime | Architecture: x86_64 (amd64) performance bug | Difficulty: Unknown Test Case: | Blocked By: Blocking: | Related Tickets: -------------------------------------+------------------------------------- Comment (by AndreasVoellmy): I did some more experiments, and I now think it has to do with making "safe" foreign calls from all of the 32 threads. I modified the IO manager so that all of the polling threads (and their epoll backends) are created. Only one of the them is used (on cap 0) is used to register files. I then modified the poll loops so that only the one on cap 0 is running the normal polling loop and the others simply return right away. With this setup, the performance is good (about 40-50 ms for -N32). However, if I have all poll loops (except the one normal poll loop for cap 0) simply make a "safe" call to `usleep(10)` then the performance goes back to about 200ms. I'll post a patch here so that others can see. So it looks to be related to the cost of doing safe foreign calls. With the parallel IO manager each of the N pollers will make a safe foreign call when there is no work to do. I'm not sure that this will have much of an impact on most programs, but for the empty program it represents a large part of the work. I'm also not sure this is related to the 8 minute system time mentioned in the original report. rnewton, can you provide a recipe for how to recreate this? Is the 200 ms startup time for 32 cores an issue? At 16 cores it is about 50-60ms and at 8 cores about 10-20ms. One possible resolution (besides doing nothing or making safe foreign calls cheaper) would be to provide an RTS option to use a specific number of IO managers, or at least to differentiate between using 1 or using N. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/8224#comment:14 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#8224: Excessive system time -- new IO manager problem? -------------------------------------+------------------------------------- Reporter: rrnewton | Owner: Type: bug | Status: new Priority: high | Milestone: 7.8.1 Component: Runtime System | Version: 7.7 Resolution: | Keywords: IO Manager, Operating System: Linux | System Time Type of failure: Runtime | Architecture: x86_64 (amd64) performance bug | Difficulty: Unknown Test Case: | Blocked By: Blocking: | Related Tickets: -------------------------------------+------------------------------------- Comment (by simonmar): The first safe call on each Capability will cause the creation of a new OS thread, which probably accounts for some of the cost. From the strace profile it looks like there is some contention too. It might be useful to do some mutex profiling to see what lock is being contended for - perhaps it is `all_tasks_mutex`. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/8224#comment:15 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#8224: Excessive system time -- new IO manager problem? -------------------------------------+------------------------------------- Reporter: rrnewton | Owner: Type: bug | Status: new Priority: high | Milestone: 7.8.1 Component: Runtime System | Version: 7.7 Resolution: | Keywords: IO Manager, Operating System: Linux | System Time Type of failure: Runtime | Architecture: x86_64 (amd64) performance bug | Difficulty: Unknown Test Case: | Blocked By: Blocking: | Related Tickets: -------------------------------------+------------------------------------- Comment (by AndreasVoellmy): I ran with mutrace and did see any obvious explanations. Is it possible that by having more safe foreign calls RTS threads spend more time in some spin locks? If so, which spin locks would those be? -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/8224#comment:16 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#8224: Excessive system time -- new IO manager problem? -------------------------------------+------------------------------------- Reporter: rrnewton | Owner: Type: bug | Status: new Priority: high | Milestone: 7.8.1 Component: Runtime System | Version: 7.7 Resolution: | Keywords: IO Manager, Operating System: Linux | System Time Type of failure: Runtime | Architecture: x86_64 (amd64) performance bug | Difficulty: Unknown Test Case: | Blocked By: Blocking: | Related Tickets: -------------------------------------+------------------------------------- Comment (by AndreasVoellmy): One interesting thing I just noticed is that `taskset` makes a big difference. `excessive_system_2_HEAD` is the empty program compiled with HEAD and `excessive_system_2_ONEACTIVELOOP` is the empty program compiled with only one IO manager loop being started. We see that for `excessive_system_2_HEAD` if we use `taskset` to restrict the threads to run on CPUs 0-9 (all on one NUMA node) we get a big difference: {{{ $ time ./excessive_system_2_HEAD +RTS -N32 real 0m0.046s user 0m0.020s sys 0m0.188s }}} {{{ $ time taskset -c 0-9 ./excessive_system_2_HEAD +RTS -N32 real 0m0.025s user 0m0.000s sys 0m0.056s }}} `excessive_system_2_ONEACTIVELOOP` also performs better with taskset but the difference is much smaller. Without taskset I see system tim from 30ms-60ms and with taskset it is about 10-20ms. So maybe the explanation is that with 31 extra safe foreign calls (from calls to `epoll_wait`) we get about 31 extra OS threads and if we are unlucky and the OS schedules these on different NUMA nodes we will get cache issues slowing down. And finally a spin lock burns longer accounting for more system time? Here is a run of `excessive_system_2_HEAD` using taskset to restrict to first NUMA node or no restrictions with perf stat: {{{ $ perf_3.2.0-25 stat taskset -c 0-9 ./excessive_system_2_HEAD +RTS -N32 Performance counter stats for 'taskset -c 0-9 ./excessive_system_2_HEAD +RTS -N32': 25.019318 task-clock # 0.900 CPUs utilized 584 context-switches # 0.023 M/sec 218 CPU-migrations # 0.009 M/sec 1,248 page-faults # 0.050 M/sec 36,387,323 cycles # 1.454 GHz [62.17%] 26,057,951 stalled-cycles-frontend # 71.61% frontend cycles idle [98.77%] 17,556,988 stalled-cycles-backend # 48.25% backend cycles idle 22,358,612 instructions # 0.61 insns per cycle # 1.17 stalled cycles per insn 4,300,427 branches # 171.884 M/sec 169,950 branch-misses # 3.95% of all branches [42.87%] 0.027811151 seconds time elapsed }}} {{{ $ perf_3.2.0-25 stat taskset -c 0-79 ./excessive_system_2_HEAD +RTS -N32 Performance counter stats for 'taskset -c 0-79 ./excessive_system_2_HEAD +RTS -N32': 157.684408 task-clock # 3.055 CPUs utilized 588 context-switches # 0.004 M/sec 147 CPU-migrations # 0.001 M/sec 1,245 page-faults # 0.008 M/sec 267,826,389 cycles # 1.698 GHz [54.94%] 227,802,954 stalled-cycles-frontend # 85.06% frontend cycles idle [66.06%] 184,677,238 stalled-cycles-backend # 68.95% backend cycles idle [86.81%] 70,842,584 instructions # 0.26 insns per cycle # 3.22 stalled cycles per insn [98.87%] 16,579,295 branches # 105.142 M/sec [96.70%] 233,921 branch-misses # 1.41% of all branches [97.31%] 0.051618204 seconds time elapsed }}} It is interesting that without restriction we get many more instructions executed and also fewer instructions per cycle. Also interesting, here are the top few entries perf recording with taskset 0-9 {{{ 5.64% excessive_syste [kernel.kallsyms] [k] update_sd_lb_stats 5.47% excessive_syste [kernel.kallsyms] [k] resched_task 4.97% excessive_syste excessive_system_2_HEAD [.] evacuate1 4.84% excessive_syste [kernel.kallsyms] [k] __ticket_spin_lock 4.18% excessive_syste libc-2.15.so [.] _int_malloc 3.85% excessive_syste [kernel.kallsyms] [k] __init_waitqueue_head 3.81% excessive_syste [kernel.kallsyms] [k] dequeue_entity 3.48% excessive_syste [kernel.kallsyms] [k] find_vma 3.14% excessive_syste [kernel.kallsyms] [k] native_write_msr_safe 3.12% excessive_syste [kernel.kallsyms] [k] source_load 3.12% excessive_syste [kernel.kallsyms] [k] idle_cpu 2.62% excessive_syste [kernel.kallsyms] [k] update_cfs_rq_blocked_load 2.33% excessive_syste [kernel.kallsyms] [k] perf_event_mmap_ctx 2.13% excessive_syste [kernel.kallsyms] [k] kmem_cache_alloc 2.07% excessive_syste excessive_system_2_HEAD [.] schedule }}} and with no restriction {{{ 43.82% excessive_syste [kernel.kallsyms] [k] __ticket_spin_lock 6.44% excessive_syste [kernel.kallsyms] [k] default_send_IPI_mask_sequence_phys 4.68% excessive_syste [kernel.kallsyms] [k] native_write_msr_safe 3.45% excessive_syste [kernel.kallsyms] [k] update_sd_lb_stats 3.31% excessive_syste libc-2.15.so [.] new_heap 2.86% excessive_syste [kernel.kallsyms] [k] idle_cpu 2.40% excessive_syste [kernel.kallsyms] [k] perf_event_mmap 1.61% excessive_syste [kernel.kallsyms] [k] __schedule 1.60% excessive_syste excessive_system_2_HEAD [.] s8Bc_info 1.56% excessive_syste [kernel.kallsyms] [k] inherit_event.isra.73 1.50% excessive_syste [kernel.kallsyms] [k] __init_waitqueue_head 1.49% excessive_syste libc-2.15.so [.] __clone 1.45% excessive_syste [kernel.kallsyms] [k] clear_page_c 1.44% excessive_syste [kernel.kallsyms] [k] try_to_wake_up }}} Pretty clearly spending more time in a spin lock when we run on all NUMA nodes. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/8224#comment:17 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#8224: Excessive system time -- new IO manager problem? -------------------------------------+------------------------------------- Reporter: rrnewton | Owner: Type: bug | Status: new Priority: high | Milestone: 7.10.1 Component: Runtime System | Version: 7.7 Resolution: | Keywords: IO Manager, Operating System: Linux | System Time Type of failure: Runtime | Architecture: x86_64 (amd64) performance bug | Difficulty: Unknown Test Case: | Blocked By: Blocking: | Related Tickets: -------------------------------------+------------------------------------- Changes (by thoughtpolice): * milestone: 7.8.3 => 7.10.1 Comment: Whatever happens, I doubt this will be fixed in time for 7.8.3, but someone can bump it back if so. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/8224#comment:18 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#8224: Excessive system time -- new IO manager problem? -------------------------------------+------------------------------------- Reporter: rrnewton | Owner: Type: bug | Status: new Priority: high | Milestone: 7.10.1 Component: Runtime | Version: 7.7 System | Keywords: IO Manager, Resolution: | System Time Operating System: Linux | Architecture: x86_64 (amd64) Type of failure: Runtime | Difficulty: Unknown performance bug | Blocked By: Test Case: | Related Tickets: Blocking: | Differential Revisions: | -------------------------------------+------------------------------------- Changes (by nh2): * cc: nh2 (added) -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/8224#comment:19 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#8224: Excessive system time -- new IO manager problem? -------------------------------------+------------------------------------- Reporter: rrnewton | Owner: Type: bug | Status: new Priority: high | Milestone: 7.12.1 Component: Runtime System | Version: 7.7 Resolution: | Keywords: IO Operating System: Linux | Manager, System Time Type of failure: Runtime | Architecture: x86_64 performance bug | (amd64) Blocked By: | Test Case: Related Tickets: | Blocking: | Differential Revisions: -------------------------------------+------------------------------------- Changes (by thoughtpolice): * milestone: 7.10.1 => 7.12.1 Comment: (Pushing back to 7.12 as this ticket requires more investigation for a fix.) -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/8224#comment:20 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#8224: Excessive system time -- new IO manager problem? -------------------------------------+------------------------------------- Reporter: rrnewton | Owner: Type: bug | Status: new Priority: high | Milestone: 7.12.1 Component: Runtime System | Version: 7.7 Resolution: | Keywords: IO Operating System: Linux | Manager, System Time Type of failure: Runtime | Architecture: x86_64 performance bug | (amd64) Blocked By: | Test Case: Related Tickets: | Blocking: | Differential Revisions: -------------------------------------+------------------------------------- Changes (by gidyn): * cc: gidyn (added) -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/8224#comment:21 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#8224: Excessive system time -- new IO manager problem? -------------------------------------+------------------------------------- Reporter: rrnewton | Owner: Type: bug | Status: new Priority: high | Milestone: 7.12.1 Component: Runtime System | Version: 7.7 Resolution: | Keywords: IO | Manager, System Time Operating System: Linux | Architecture: x86_64 Type of failure: Runtime | (amd64) performance bug | Test Case: Blocked By: | Blocking: Related Tickets: #9221 | Differential Revisions: -------------------------------------+------------------------------------- Changes (by thomie): * related: => #9221 -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/8224#comment:22 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#8224: Excessive system time -- new IO manager problem? -------------------------------------+------------------------------------- Reporter: rrnewton | Owner: Type: bug | Status: new Priority: high | Milestone: 8.2.1 Component: Runtime System | Version: 7.7 Resolution: | Keywords: IO Manager, | System Time Operating System: Linux | Architecture: x86_64 Type of failure: Runtime | (amd64) performance bug | Test Case: Blocked By: | Blocking: Related Tickets: #9221 | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Changes (by bgamari): * milestone: 8.0.1 => 8.2.1 Comment: Sadly this won't be resolved for 8.0.1. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/8224#comment:24 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#8224: Excessive system time -- new IO manager problem? -------------------------------------+------------------------------------- Reporter: rrnewton | Owner: Type: bug | Status: new Priority: high | Milestone: 8.2.1 Component: Runtime System | Version: 7.7 Resolution: | Keywords: IO Manager, | System Time Operating System: Linux | Architecture: x86_64 Type of failure: Runtime | (amd64) performance bug | Test Case: Blocked By: | Blocking: Related Tickets: #9221 | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Changes (by slyfox): * cc: slyfox (added) -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/8224#comment:25 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#8224: Excessive system time -- new IO manager problem? -------------------------------------+------------------------------------- Reporter: rrnewton | Owner: Type: bug | Status: new Priority: high | Milestone: 8.2.1 Component: Runtime System | Version: 7.7 Resolution: | Keywords: IO Manager, | System Time Operating System: Linux | Architecture: x86_64 Type of failure: Runtime | (amd64) performance bug | Test Case: Blocked By: | Blocking: Related Tickets: #9221 | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by hjarvi): upvote -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/8224#comment:26 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#8224: Excessive system time -- new IO manager problem? -------------------------------------+------------------------------------- Reporter: rrnewton | Owner: Type: bug | Status: new Priority: high | Milestone: 8.2.1 Component: Runtime System | Version: 7.7 Resolution: | Keywords: IO Manager, | System Time Operating System: Linux | Architecture: x86_64 Type of failure: Runtime | (amd64) performance bug | Test Case: Blocked By: | Blocking: Related Tickets: #9221 | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Changes (by hjarvi): * cc: hjarvi (added) -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/8224#comment:27 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#8224: Excessive system time -- new IO manager problem? -------------------------------------+------------------------------------- Reporter: rrnewton | Owner: Type: bug | Status: new Priority: high | Milestone: 8.2.1 Component: Runtime System | Version: 7.7 Resolution: | Keywords: IO Manager, | System Time Operating System: Linux | Architecture: x86_64 Type of failure: Runtime | (amd64) performance bug | Test Case: Blocked By: | Blocking: Related Tickets: #9221 | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by bgamari): Replying to [comment:26 hjarvi]:
upvote
Do you have an example of an application which is particularly badly affected by this issue? It would be quite helpful if you could provide a testcase if so. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/8224#comment:28 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#8224: Excessive system time -- new IO manager problem?
-------------------------------------+-------------------------------------
Reporter: rrnewton | Owner:
Type: bug | Status: new
Priority: high | Milestone: 8.2.1
Component: Runtime System | Version: 7.7
Resolution: | Keywords: IO Manager,
| System Time
Operating System: Linux | Architecture: x86_64
Type of failure: Runtime | (amd64)
performance bug | Test Case:
Blocked By: | Blocking:
Related Tickets: #9221 | Differential Rev(s):
Wiki Page: |
-------------------------------------+-------------------------------------
Comment (by bgamari):
I tried reproducing this with cf5eec3eaa638719fd9768c20271f8aa2b2eac1f on
a 2-socket, 24-core machine I have access to. I was unable to see anything
like the system time numbers like reported in comment:17,
{{{#!bash
$ cat >T8224.hs <

#8224: Excessive system time -- new IO manager problem? -------------------------------------+------------------------------------- Reporter: rrnewton | Owner: Type: bug | Status: new Priority: high | Milestone: 8.2.1 Component: Runtime System | Version: 7.7 Resolution: | Keywords: IO Manager, | System Time Operating System: Linux | Architecture: x86_64 Type of failure: Runtime | (amd64) performance bug | Test Case: Blocked By: | Blocking: Related Tickets: #9221 | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by alkar): Hi, as a developer on 48-core machine, I upvote the issue. I can reproduce it with many programs, e.g. this trivial one: {{{ #!bash $ cat x.hs main = getContents >>= print . length $ ghc -O2 -rtsopts -threaded x $ time head -c10MB /dev/zero | ./x 10000000 real 0m0.182s user 0m0.180s sys 0m0.012s $ time head -c10MB /dev/zero | ./x +RTS -N 10000000 real 0m1.782s user 0m45.183s sys 0m10.305s }}} -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/8224#comment:30 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#8224: Excessive system time -- new IO manager problem? -------------------------------------+------------------------------------- Reporter: rrnewton | Owner: Type: bug | Status: new Priority: high | Milestone: 8.2.1 Component: Runtime System | Version: 7.7 Resolution: | Keywords: IO Manager, | System Time Operating System: Linux | Architecture: x86_64 Type of failure: Runtime | (amd64) performance bug | Test Case: Blocked By: | Blocking: Related Tickets: #9221 | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by slyfox): Replying to [comment:30 alkar]:
Hi, as a developer on 48-core machine, I upvote the issue. I can reproduce it with many programs, e.g. this trivial one:
{{{ #!bash $ cat x.hs main = getContents >>= print . length $ ghc -O2 -rtsopts -threaded x $ time head -c10MB /dev/zero | ./x 10000000
real 0m0.182s user 0m0.180s sys 0m0.012s $ time head -c10MB /dev/zero | ./x +RTS -N 10000000
real 0m1.782s user 0m45.183s sys 0m10.305s }}}
The problem can be tamed with -A option (somewhere around 200M is optimal in my case), but still it's pretty bad:
{{{ #!bash $ time head -c10MB /dev/zero | ./x +RTS -N -A200M 10000000
real 0m0.712s user 0m1.144s sys 0m0.412s }}}
Can you share a bit more info on your setup? - GHC version you tried it on - where does '''+RTS -s''' claim to attribute the time to - '''lstopo-no-graphics''' output - '''numactl''' -H output A few tweaks happened in #9221 since latest GHC-8.0.1 release. It has a few more knobs to tune like '''-qn<value>''' (number of GC threads), '''-n<value>''' (nursery chunks), '''-qb0''' (work-stealing nursery scan mode). -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/8224#comment:31 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#8224: Excessive system time -- new IO manager problem? -------------------------------------+------------------------------------- Reporter: rrnewton | Owner: Type: bug | Status: new Priority: high | Milestone: 8.2.1 Component: Runtime System | Version: 7.7 Resolution: | Keywords: IO Manager, | System Time Operating System: Linux | Architecture: x86_64 Type of failure: Runtime | (amd64) performance bug | Test Case: Blocked By: | Blocking: Related Tickets: #9221 | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Changes (by alkar): * Attachment "numactl.txt" added. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/8224 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#8224: Excessive system time -- new IO manager problem? -------------------------------------+------------------------------------- Reporter: rrnewton | Owner: Type: bug | Status: new Priority: high | Milestone: 8.2.1 Component: Runtime System | Version: 7.7 Resolution: | Keywords: IO Manager, | System Time Operating System: Linux | Architecture: x86_64 Type of failure: Runtime | (amd64) performance bug | Test Case: Blocked By: | Blocking: Related Tickets: #9221 | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Changes (by alkar): * Attachment "lstopo.txt" added. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/8224 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#8224: Excessive system time -- new IO manager problem? -------------------------------------+------------------------------------- Reporter: rrnewton | Owner: Type: bug | Status: new Priority: high | Milestone: 8.2.1 Component: Runtime System | Version: 7.7 Resolution: | Keywords: IO Manager, | System Time Operating System: Linux | Architecture: x86_64 Type of failure: Runtime | (amd64) performance bug | Test Case: Blocked By: | Blocking: Related Tickets: #9221 | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by alkar): In previous post I used ghc-7.10.1 by mistake. With 8.0.1 (`stack ghc -- -O2 -rtsopts -threaded x`) single core performance is same (around `0.18s`), but with `+RTS -N` I get better results: {{{ real 0m0.843s user 0m22.149s sys 0m2.256s }}} Even better with `-A200M`: {{{ real 0m0.793s user 0m1.200s sys 0m0.692s }}} {{{ $ head -c10MB /dev/zero | ./x +RTS -s -N > /dev/null 406,012,416 bytes allocated in the heap 30,457,408 bytes copied during GC 674,776 bytes maximum residency (2 sample(s)) 665,248 bytes maximum slop 27 MB total memory in use (0 MB lost due to fragmentation) Tot time (elapsed) Avg pause Max pause Gen 0 775 colls, 775 par 16.945s 0.429s 0.0006s 0.0302s Gen 1 2 colls, 1 par 0.052s 0.002s 0.0012s 0.0013s Parallel GC work balance: 0.14% (serial 0%, perfect 100%) TASKS: 98 (1 bound, 97 peak workers (97 total), using -N48) SPARKS: 0 (0 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled) INIT time 2.660s ( 0.091s elapsed) MUT time 11.065s ( 0.744s elapsed) GC time 16.997s ( 0.432s elapsed) EXIT time 0.028s ( 0.003s elapsed) Total time 30.750s ( 1.270s elapsed) Alloc rate 36,694,519 bytes per MUT second Productivity 36.1% of total user, 873.3% of total elapsed gc_alloc_block_sync: 38774 whitehole_spin: 0 gen[0].sync: 408 gen[1].sync: 130 }}} I attached outputs of `lstopo` (I don't have -no-graphics variant there) and `numactl -H`. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/8224#comment:32 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#8224: Excessive system time -- new IO manager problem? -------------------------------------+------------------------------------- Reporter: rrnewton | Owner: Type: bug | Status: new Priority: high | Milestone: 8.2.1 Component: Runtime System | Version: 7.7 Resolution: | Keywords: IO Manager, | System Time Operating System: Linux | Architecture: x86_64 Type of failure: Runtime | (amd64) performance bug | Test Case: Blocked By: | Blocking: Related Tickets: #9221 | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by bgamari): In reply to comment:31
Hi, as a developer on 48-core machine, I upvote the issue. I can reproduce it with many programs, e.g. this trivial one:
It is far from clear to me that we should expect the program you cited to run well with `-N`. It allocates heavily and `-N` will use parallel GC on a heavily imbalanced heap. Looking at the eventlog confirms that nearly all of the cores are simply waiting for GC. As expected disabling parallel GC with `+RTS -qg` drastically reduces system time due to reduced synchronization overhead. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/8224#comment:33 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#8224: Excessive system time -- new IO manager problem? -------------------------------------+------------------------------------- Reporter: rrnewton | Owner: Type: bug | Status: new Priority: high | Milestone: 8.2.1 Component: Runtime System | Version: 7.7 Resolution: | Keywords: IO Manager, | System Time Operating System: Linux | Architecture: x86_64 Type of failure: Runtime | (amd64) performance bug | Test Case: Blocked By: | Blocking: Related Tickets: #9221 | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by alkar): Replying to [comment:33 bgamari]:
As expected disabling parallel GC with `+RTS -qg` drastically reduces system time due to reduced synchronization overhead.
Wow! Just tried `+RTS -N -qg` and brings performance back in the sane world! Now it's something like {{{ real 0m0.269s user 0m0.232s sys 0m0.104s }}} That is, 30-40% slower than without `-N`. A real-life example of a program where I'd want a long single thread computation with `-N` turned on is a faster analog of `fgrep -f patterns.txt input1 ... inputN`: 1. Read all lines from `patterns.txt` into a set. 2. In parallel for each input file, `filter (\str -> member str set) . lines` -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/8224#comment:34 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

A real-life example of a program where I'd want a long single thread computation with `-N` turned on is a faster analog of `fgrep -f
#8224: Excessive system time -- new IO manager problem? -------------------------------------+------------------------------------- Reporter: rrnewton | Owner: Type: bug | Status: new Priority: high | Milestone: 8.2.1 Component: Runtime System | Version: 7.7 Resolution: | Keywords: IO Manager, | System Time Operating System: Linux | Architecture: x86_64 Type of failure: Runtime | (amd64) performance bug | Test Case: Blocked By: | Blocking: Related Tickets: #9221 | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by bgamari): Replying to [comment:34 alkar]: patterns.txt input1 ... inputN`:
1. Read all lines from `patterns.txt` into a set. 2. In parallel for each input file, `filter (\str -> member str set) .
lines` I'm afraid I don't understand; this would be a multi-threaded computation, would it not? In general there are very real costs to synchronization, especially across cores on large machines. The performance hit that you report above sounds a bit high and it would be nice to bring it down. Perhaps you could do some digging to see where this time is going? -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/8224#comment:35 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#8224: Excessive system time -- new IO manager problem? -------------------------------------+------------------------------------- Reporter: rrnewton | Owner: Type: bug | Status: new Priority: high | Milestone: 8.2.1 Component: Runtime System | Version: 7.7 Resolution: | Keywords: IO Manager, | System Time Operating System: Linux | Architecture: x86_64 Type of failure: Runtime | (amd64) performance bug | Test Case: Blocked By: | Blocking: Related Tickets: #9221 | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by alkar): Here is a difference between outputs for `+RTS -s` and `+RTS -N -qg -s`: https://i.imgur.com/8S5brky.png In my grep-like example, the first step is single-threaded (in the sense that I don't do any forkIO):
1. Read all lines from patterns.txt into a set.
And the pattern file can be quite large (this is why I'm doing it in a single process instead of multiprocessing), so the less penalty I get by using `-N` the better. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/8224#comment:36 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#8224: Excessive system time -- new IO manager problem? -------------------------------------+------------------------------------- Reporter: rrnewton | Owner: Type: bug | Status: new Priority: high | Milestone: 8.2.1 Component: Runtime System | Version: 7.7 Resolution: | Keywords: IO Manager, | System Time Operating System: Linux | Architecture: x86_64 Type of failure: Runtime | (amd64) performance bug | Test Case: Blocked By: | Blocking: Related Tickets: #9221 | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Comment (by bgamari): Replying to [comment:36 alkar]:
1. Read all lines from patterns.txt into a set.
And the pattern file can be quite large (this is why I'm doing it in a single process instead of multiprocessing), so the less penalty I get by using `-N` the better.
Keep in mind that this doesn't necessarily mean that you need a single process. With the coming 8.2 release it will be possible to share large data structures like your patterns set in a single `mmap`'d shared memory block using compact regions. Given the challenges of ensuring good scaling with our garbage collector you may want to look into this approach. That being said, it would be great if someone (perhaps you!) could look into why the system time numbers you report are so high. 30% runtime hit on a program which uses only one capability and no parallel GC seems quite high. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/8224#comment:37 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#8224: Excessive system time -- new IO manager problem? -------------------------------------+------------------------------------- Reporter: rrnewton | Owner: Type: bug | Status: new Priority: high | Milestone: 8.2.1 Component: Runtime System | Version: 7.7 Resolution: | Keywords: IO Manager, | System Time Operating System: Linux | Architecture: x86_64 Type of failure: Runtime | (amd64) performance bug | Test Case: Blocked By: | Blocking: Related Tickets: #9221 | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Changes (by jystic): * cc: jystic (added) -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/8224#comment:38 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler

#8224: Excessive system time -- new IO manager problem? -------------------------------------+------------------------------------- Reporter: rrnewton | Owner: Type: bug | Status: closed Priority: high | Milestone: Component: Runtime System | Version: 7.7 Resolution: worksforme | Keywords: IO Manager, | System Time Operating System: Linux | Architecture: x86_64 Type of failure: Runtime | (amd64) performance bug | Test Case: Blocked By: | Blocking: Related Tickets: #9221 | Differential Rev(s): Wiki Page: | -------------------------------------+------------------------------------- Changes (by bgamari): * status: new => closed * resolution: => worksforme * milestone: 8.2.1 => Comment: I'm closing this since this issue has turned into a bit of a mess of various unrelated scaling issues, none of which are IO manager-related. @alkar, if you get a chance please do open new ticket describing your particular issue and how to reproduce it. -- Ticket URL: http://ghc.haskell.org/trac/ghc/ticket/8224#comment:39 GHC http://www.haskell.org/ghc/ The Glasgow Haskell Compiler
participants (1)
-
GHC