Does anyone know the reason for differeng measurements in profiling under -threaded?

Hi, I am asking if anyone has seen the following behaviour from GHC under -threaded and heavy use of File I/O, network I/O and STM use. After having run Combinatorrent for a while and then terminating it, we get the following output from the RTS in GC statistics: INIT time 0.00s ( 0.00s elapsed) MUT time 0.22s ( 60.06s elapsed) GC time 0.05s ( 0.07s elapsed) RP time 0.00s ( 0.00s elapsed) PROF time 0.00s ( 0.00s elapsed) EXIT time 0.00s ( 0.00s elapsed) Total time 0.26s ( 60.12s elapsed) %GC time 18.2% (0.1% elapsed) Alloc rate 431,635,651 bytes per MUT second Productivity 80.3% of total user, 0.4% of total elapsed Note that I am running it for exactly 60 seconds of wall-clock time. Yet, the profile information reports: total time = 11.20 secs (560 ticks @ 20 ms) total alloc = 78,696,568 bytes (excludes profiling overheads) COST CENTRE MODULE %time %alloc ticks bytes MAIN MAIN 80.4 0.2 450 20033 PeerControl Process.Peer 11.2 1.0 63 95272 Sender Process.Peer.Sender 6.2 0.0 35 387 sha1_digest Digest 1.8 1.4 10 136377 checkPiece FS 0.2 92.3 1 9083658 TorrentManager Process.TorrentManager 0.2 1.8 1 179587 which is clearly wrong. Total time is nowhere near 11.20 secs with 560 ticks here. I am almost certain that the 450 ticks attributed to MAIN is wrong as well. I have SCCs at each possible forkIO-point in the program. -prof -auto-all -caf-all doesn't make any difference either, I've tried. Has anyone seen this kind of problem and in that case, what was the fix? Barring nobody knows, the only thing I have left to try is to dive into the GHC RTS and figure out what is going on in there. As an aside: Before writing combinatorrent, I wrote etorrent, A BitTorrent client in Erlang. The two clients follow more or less the same basic structure and largely does the same ting. Combinatorrent has better memory usage right now, mostly due to representation choices: Haskell can do much better data representation than erlang. On the other hand, etorrent still beats out combinatorrent in the CPU department, using less CPU. And that somewhat perplexes me. This is why I am looking at performance profiles and trying to find the culprit. But with the above problem, I can't find it, since I don't trust the profiler anymore to give me the right results. All logic suggest that Haskell is ripping apart Erlang in all parts that has no communication, mostly due to being compiled and having a static type system (The erlang code is interpreted, even though they have a HiPE JIT). In practice however, my guess is that the binary handling in Erlang is quite effective, as is the communication between processes, so this might turn the tides in the favor of the erlang-RTS. As a fun thing, I tried downloading the same file in the test environment and running strace -cf on both programs. Here are the top winners. I don't think you can trust all the columns: Erlang: % time seconds usecs/call calls errors syscall ------ ----------- ----------- --------- --------- ---------------- 95.57 285.220815 2241 127300 10518 futex 2.16 6.453726 271 23828 9838 read 2.15 6.420512 279153 23 wait4 0.07 0.198851 3 64073 1 epoll_wait 0.02 0.048004 4000 12 poll 0.01 0.039131 2 16297 write 0.01 0.026917 0 88057 37842 recvfrom 0.01 0.018174 0 75803 36 epoll_ctl 0.00 0.008001 667 12 1 select 0.00 0.007133 1 5138 3 lseek 0.00 0.001931 64 30 6 execve 0.00 0.001912 0 5535 writev 0.00 0.001334 191 7 vfork 0.00 0.001209 0 2684 1995 open 0.00 0.001128 0 3094 2257 stat 0.00 0.000942 6 153 munmap Haskell/GHC: % time seconds usecs/call calls errors syscall ------ ----------- ----------- --------- --------- ---------------- 32.20 15834.075616 176163 89883 10065 futex 21.72 10682.355809 176009 60692 26672 recvfrom 12.22 6009.064699 176023 34138 18 select 10.25 5038.622913 176009 28627 1 write 8.45 4153.652812 176010 23599 1 read 4.84 2380.702238 176009 13526 1 writev 3.59 1765.107536 175965 10031 rt_sigprocmask 1.95 961.185526 176009 5461 fstat 1.94 956.257032 176009 5433 lseek 1.80 883.829682 175922 5024 getrusage 0.85 419.781465 176009 2385 107 rt_sigreturn 0.04 19.713008 176009 112 fcntl 0.03 12.320630 176009 70 mmap 0.02 9.680495 176009 55 socket 0.02 9.504619 176011 54 53 connect 0.02 8.976459 176009 51 getsockopt 0.01 5.456279 176009 31 close Apart from Erlang using epoll, there is not too much difference from the perspective of the kernel, which I think is rather good :) -- J.

On Fri, Apr 23, 2010 at 12:14:29AM +0200, Jesper Louis Andersen wrote:
Hi,
I am asking if anyone has seen the following behaviour from GHC under -threaded and heavy use of File I/O, network I/O and STM use. After having run Combinatorrent for a while and then terminating it, we get the following output from the RTS in GC statistics:
[snip] I don't know GHC internals, but from the description of the program (and the fact that you don't use more than one core), I wonder why you use -threaded? How does the behaviour change if you use a single-threaded runtime? (I'm asking this as the single-threaded runtime can be much faster in certain use-cases.) regards, iustin

On Fri, Apr 23, 2010 at 2:19 AM, Iustin Pop
I don't know GHC internals, but from the description of the program (and the fact that you don't use more than one core), I wonder why you use -threaded?
A worthy question indeed. The reason is that I am using registerDelay and that requires -threaded. I could try rewriting the code to not use that. -- J.
participants (2)
-
Iustin Pop
-
Jesper Louis Andersen