ghci and ghc -threaded broken with pipes & forking

Hi, I've been hitting my head against a wall for the past couple of days trying to figure out why my shell-like pipeline code kept hanging. I found fd leakage (file descriptors not being closed), which disrupts EOF detection and can lead to deadlocks. I just couldn't find the problem. I finally tried compiling my test with ghc instead of running it in ghci. And poof, it worked fine the first time. I tried asking on #haskell, and got the suggestion that ghci uses -threaded. I tried compiling my test program with ghc -threaded, and again, same deadlock. My program never calls forkIO or forkOS or any other threading code. You can see my test case with: darcs get '--tag=glasgow ml' http://darcs.complete.org/hsh ghc -fglasgow-exts --make -o test2 test2.hs That'll run fine. If you add -threaded, it will hang. Ideas? Thanks, -- John

At Wed, 28 Feb 2007 11:15:04 -0600, John Goerzen wrote:
You can see my test case with:
darcs get '--tag=glasgow ml' http://darcs.complete.org/hsh ghc -fglasgow-exts --make -o test2 test2.hs
I get an erro when I use that darcs command-line, and test2.hs does not appear to be in the directory afterwards. Am I doing something wrong ? $ darcs get '--tag=glasgow ml' http://darcs.complete.org/hsh Copying patch 54 of 54... done! Applying patch 54 of 54... done. darcs: Couldn't find a tag matching "tag-name glasgow ml" $ dpkg -l darcs Desired=Unknown/Install/Remove/Purge/Hold | Status=Not/Installed/Config-files/Unpacked/Failed-config/Half-installed |/ Err?=(none)/Hold/Reinst-required/X=both-problems (Status,Err: uppercase=bad) ||/ Name Version Description +++-==============================-==============================-============================================================================ ii darcs 1.0.9~rc1-0.1 an advanced revision control system j.

On Wed, Feb 28, 2007 at 10:40:18AM -0800, Jeremy Shaw wrote:
At Wed, 28 Feb 2007 11:15:04 -0600, John Goerzen wrote:
You can see my test case with:
darcs get '--tag=glasgow ml' http://darcs.complete.org/hsh ghc -fglasgow-exts --make -o test2 test2.hs
I get an erro when I use that darcs command-line, and test2.hs does not appear to be in the directory afterwards. Am I doing something wrong ?
Oops. I hadn't pushed out that tag yet. It's there now.

Hello, Your first problem is just a line buffering issue. You need to explicitly set the line buffer inside the child processes: redir fstdin stdInput hSetBuffering stdin LineBuffering redir fstdout stdOutput hSetBuffering stdout LineBuffering This is because the forked child process in not hooked up to a tty, so GHC decides that block buffering would be a good choice. Once you fix that you will encounter some new race condition type bugs. These bugs will show up, even *without* the -threaded flag. hth, j. At Wed, 28 Feb 2007 13:29:17 -0600, John Goerzen wrote:
On Wed, Feb 28, 2007 at 10:40:18AM -0800, Jeremy Shaw wrote:
At Wed, 28 Feb 2007 11:15:04 -0600, John Goerzen wrote:
You can see my test case with:
darcs get '--tag=glasgow ml' http://darcs.complete.org/hsh ghc -fglasgow-exts --make -o test2 test2.hs
I get an erro when I use that darcs command-line, and test2.hs does not appear to be in the directory afterwards. Am I doing something wrong ?
Oops. I hadn't pushed out that tag yet. It's there now.

On Wed, Feb 28, 2007 at 01:06:25PM -0800, Jeremy Shaw wrote:
Hello,
Your first problem is just a line buffering issue. You need to explicitly set the line buffer inside the child processes:
redir fstdin stdInput hSetBuffering stdin LineBuffering redir fstdout stdOutput hSetBuffering stdout LineBuffering
Hi Jeremy, First, many thanks for looking into this. That doesn't make sense to me, since these aren't used for anything in Haskell prior to the call to executeFile. The Haskell buffers should just disappear, since the Haskell process disappears, right?
Once you fix that you will encounter some new race condition type bugs. These bugs will show up, even *without* the -threaded flag.
Hrm, could you point out a couple? I'm developing as many unit tests as I can, and haven't had any problem running them under a non-threaded GHC. I am aware that the debug statements can write over each other at some cases, or even get inserted into the pipeline in a few situations, but these are only used in exceptional cases and will generally be removed from the code before too long. Other than that, I think I've got it OK. My unit tests are covering singleton commands and 2-4 commands in a pipe, including various permutations of calls to external programs and Haskell functions.

Hello, Hrm, setting the LineBuffering mode had the side-effect of setting the underlying file description to non-blocking mode. When the executeFile process took over, it would die with an error about 'standard input temporarily unavailable'. So, ignore that. j. At Wed, 28 Feb 2007 15:23:53 -0600, John Goerzen wrote:
On Wed, Feb 28, 2007 at 01:06:25PM -0800, Jeremy Shaw wrote:
Hello,
Your first problem is just a line buffering issue. You need to explicitly set the line buffer inside the child processes:
redir fstdin stdInput hSetBuffering stdin LineBuffering redir fstdout stdOutput hSetBuffering stdout LineBuffering
Hi Jeremy,
First, many thanks for looking into this.
That doesn't make sense to me, since these aren't used for anything in Haskell prior to the call to executeFile. The Haskell buffers should just disappear, since the Haskell process disappears, right?
Once you fix that you will encounter some new race condition type bugs. These bugs will show up, even *without* the -threaded flag.
Hrm, could you point out a couple? I'm developing as many unit tests as I can, and haven't had any problem running them under a non-threaded GHC.
I am aware that the debug statements can write over each other at some cases, or even get inserted into the pipeline in a few situations, but these are only used in exceptional cases and will generally be removed from the code before too long.
Other than that, I think I've got it OK. My unit tests are covering singleton commands and 2-4 commands in a pipe, including various permutations of calls to external programs and Haskell functions.

Hello, Here is a simplified example that seems to exhibit the same behaviour, unless I screwed up: ---> module Main where import System.Posix import System.IO import System.Exit main = do putStrLn "running..." (stdinr, stdinw) <- createPipe (stdoutr, stdoutw) <- createPipe pid <- forkProcess $ do hw <- fdToHandle stdoutw hr <- fdToHandle stdinr closeFd stdinw hGetContents hr >>= hPutStr hw hClose hr hClose hw exitImmediately ExitSuccess closeFd stdoutw closeFd stdinw hr2 <- fdToHandle stdoutr hGetContents hr2 >>= putStr getProcessStatus True False pid >>= print <--- Compiling with: ghc --make -no-recomp test3.hs -o test3 && ./test3 works. But compiling with: ghc --make -no-recomp -threaded test3.hs -o test3 && ./test3 results in a hang. If you comment out the "hGetContents hr >>=" and change 'hPutStr hw' to 'hPutStr hw "hi"', then it seems to work ok. As you suggested, it seems that hGetContents is not ever seeing the EOF when -threaded is enabled. I think it gets 'Resource temporarily unavailable' instead. So, it keeps retrying. Assuming I have recreated the same bug, we at least have a simpiler test case now... j. At Wed, 28 Feb 2007 11:15:04 -0600, John Goerzen wrote:
Hi,
I've been hitting my head against a wall for the past couple of days trying to figure out why my shell-like pipeline code kept hanging. I found fd leakage (file descriptors not being closed), which disrupts EOF detection and can lead to deadlocks. I just couldn't find the problem.
I finally tried compiling my test with ghc instead of running it in ghci.
And poof, it worked fine the first time.
I tried asking on #haskell, and got the suggestion that ghci uses -threaded. I tried compiling my test program with ghc -threaded, and again, same deadlock. My program never calls forkIO or forkOS or any other threading code.
You can see my test case with:
darcs get '--tag=glasgow ml' http://darcs.complete.org/hsh ghc -fglasgow-exts --make -o test2 test2.hs
That'll run fine. If you add -threaded, it will hang.
Ideas?
Thanks,
-- John
_______________________________________________ Glasgow-haskell-users mailing list Glasgow-haskell-users@haskell.org http://www.haskell.org/mailman/listinfo/glasgow-haskell-users

Ok, what happens here is that in the forked process there is only a single thread, the runtime kills all the other threads (as advertised). Unfortunately this includes the I/O manager thread, so as soon as you do some I/O in the forked process, you block. It might be possible to fix this, but not easily I'm afraid, because the I/O manager doesn't currently have a way to restart after it's been killed. We could implement that, though. I'll create a bug report. On a more general note, forkProcess is known to be hairy - simply the fact that it kills all the other threads in the system in the forked process means that there's a good supply of means to shoot yourself in the foot, even accidentally. John - perhaps there's another way to achieve what you want? Cheers, Simon Jeremy Shaw wrote:
Hello,
Here is a simplified example that seems to exhibit the same behaviour, unless I screwed up:
--->
module Main where
import System.Posix import System.IO import System.Exit
main = do putStrLn "running..." (stdinr, stdinw) <- createPipe (stdoutr, stdoutw) <- createPipe pid <- forkProcess $ do hw <- fdToHandle stdoutw hr <- fdToHandle stdinr closeFd stdinw hGetContents hr >>= hPutStr hw hClose hr hClose hw exitImmediately ExitSuccess closeFd stdoutw closeFd stdinw hr2 <- fdToHandle stdoutr hGetContents hr2 >>= putStr getProcessStatus True False pid >>= print
<---
Compiling with:
ghc --make -no-recomp test3.hs -o test3 && ./test3
works. But compiling with:
ghc --make -no-recomp -threaded test3.hs -o test3 && ./test3
results in a hang. If you comment out the "hGetContents hr >>=" and change 'hPutStr hw' to 'hPutStr hw "hi"', then it seems to work ok.
As you suggested, it seems that hGetContents is not ever seeing the EOF when -threaded is enabled. I think it gets 'Resource temporarily unavailable' instead. So, it keeps retrying.
Assuming I have recreated the same bug, we at least have a simpiler test case now...
j.
At Wed, 28 Feb 2007 11:15:04 -0600, John Goerzen wrote:
Hi,
I've been hitting my head against a wall for the past couple of days trying to figure out why my shell-like pipeline code kept hanging. I found fd leakage (file descriptors not being closed), which disrupts EOF detection and can lead to deadlocks. I just couldn't find the problem.
I finally tried compiling my test with ghc instead of running it in ghci.
And poof, it worked fine the first time.
I tried asking on #haskell, and got the suggestion that ghci uses -threaded. I tried compiling my test program with ghc -threaded, and again, same deadlock. My program never calls forkIO or forkOS or any other threading code.
You can see my test case with:
darcs get '--tag=glasgow ml' http://darcs.complete.org/hsh ghc -fglasgow-exts --make -o test2 test2.hs
That'll run fine. If you add -threaded, it will hang.
Ideas?
Thanks,
-- John
_______________________________________________ Glasgow-haskell-users mailing list Glasgow-haskell-users@haskell.org http://www.haskell.org/mailman/listinfo/glasgow-haskell-users
Glasgow-haskell-users mailing list Glasgow-haskell-users@haskell.org http://www.haskell.org/mailman/listinfo/glasgow-haskell-users

On Thu, Mar 01, 2007 at 03:06:22PM +0000, Simon Marlow wrote:
Ok, what happens here is that in the forked process there is only a single thread, the runtime kills all the other threads (as advertised). Unfortunately this includes the I/O manager thread, so as soon as you do some I/O in the forked process, you block.
Could it just revert to the nonthreaded IO model, or is that not within the scope of what's easily achievable with the threaded RTS?
On a more general note, forkProcess is known to be hairy - simply the fact that it kills all the other threads in the system in the forked process means that there's a good supply of means to shoot yourself in the foot, even accidentally. John - perhaps there's another way to achieve what you want?
Right. Part of this problem may be one of documentation, and part of it rests with ghci. I have no need for threads in this program. And, in fact, as you said, threads are known to be hazardous when used in conjuntion with fork(). I have no interest in combining the to. The mechanics of signal propogation, file descriptor closing, etc. all get complicated. But it seems like there is not much choice with ghci. It appears to be built with the threaded RTS by default, and uses threads even though I never try to use threads with it. And there seems to be no way to turn it off. Between that and the lack of support for forkProcess in Hugs, this renders anything that needs to fork and then do I/O as being usable only in GHC-compiled code. Which is sub-optimal, but livable anyway. Also, why does hGetContents not work, but hPutStr does? If the IO manager is dead, how does some IO still work? -- John

John Goerzen wrote:
On Thu, Mar 01, 2007 at 03:06:22PM +0000, Simon Marlow wrote:
Ok, what happens here is that in the forked process there is only a single thread, the runtime kills all the other threads (as advertised). Unfortunately this includes the I/O manager thread, so as soon as you do some I/O in the forked process, you block.
Could it just revert to the nonthreaded IO model, or is that not within the scope of what's easily achievable with the threaded RTS?
The non-threaded I/O system just isn't compiled into the threaded RTS at all. We used to use it in the threaded RTS before we switched to the I/O manager thread, but as I recall it was a rich source of bugs; the I/O manager thread is much simpler, being in Haskell.
On a more general note, forkProcess is known to be hairy - simply the fact that it kills all the other threads in the system in the forked process means that there's a good supply of means to shoot yourself in the foot, even accidentally. John - perhaps there's another way to achieve what you want?
Right. Part of this problem may be one of documentation, and part of it rests with ghci.
I have no need for threads in this program. And, in fact, as you said, threads are known to be hazardous when used in conjuntion with fork(). I have no interest in combining the to. The mechanics of signal propogation, file descriptor closing, etc. all get complicated.
But it seems like there is not much choice with ghci. It appears to be built with the threaded RTS by default, and uses threads even though I never try to use threads with it. And there seems to be no way to turn it off.
The problem is that the choice between -threaded and non-threaded is made at link-time, so we have to make that choice when we link the GHCi binary. In fact you should think of the non-threaded RTS as deprecated. It isn't Haskell'-compliant, for one thing (assuming that Haskell' will probably require non-blocking foreign calls). I'm hesitant to actually deprecate it, for a few reasons: the threaded RTS is so much more complicated, it might have some adverse performance impliciations, and there are still people who want to run everything in a single OS thread, for whatever reason. But having multiple variants of the RTS is a maintenance and testing headache.
Between that and the lack of support for forkProcess in Hugs, this renders anything that needs to fork and then do I/O as being usable only in GHC-compiled code. Which is sub-optimal, but livable anyway.
I guess I'm really wondering why you need to fork and do I/O at all. Can you describe the problem at a higher level?
Also, why does hGetContents not work, but hPutStr does? If the IO manager is dead, how does some IO still work?
Ah well, only I/O that needs to block uses the I/O manager thread. I/O that doesn't block just proceeds directly. Cheers, Simon

On Thu, Mar 01, 2007 at 04:21:45PM +0000, Simon Marlow wrote:
Between that and the lack of support for forkProcess in Hugs, this renders anything that needs to fork and then do I/O as being usable only in GHC-compiled code. Which is sub-optimal, but livable anyway.
I guess I'm really wondering why you need to fork and do I/O at all. Can you describe the problem at a higher level?
I am, for all intents and purposes, writing what amounts to a simple shell. The standard way of implemeting pipes between two external programs in Unix involves setting up pipes and forking, then duping things to stdin/stdout, and execing the final program. In this case, I am setting it up to let people pipe to Haskell functions as well, forking off a process that works with pipes to handle them. I know how all these things work in Unix, in C, in Python, etc. I have no idea how all of this will interact if I were to use forkOS. It is not clear to me what the semantics of forkProcess, executeFile, signal handling, etc. are under a Haskell thread instead of a forked process. This is, as far as I can tell, completely undocumented in System.Posix.* and the subject of differing advice on the WWW. But let me add a voice to keeping the non-threaded RTS around. I have learned the hard way that the threaded RTS is ported only to a very few platforms, a distinct minority of the platforms that Debian supports, for instance. (Just like ghci). Whereas the non-threaded RTS is supported much more broadly (such as Alpha support). My own program hpodder has failed to build in Debian on many platforms because I didn't realize this going in. Not only that, but it is apparent that the threaded RTS is simply inappropriate when a person is trying to do anything remotely low-level on the system. I would hate to have to become a Haskell refugee, going back to Python, because Haskell I/O has become incompatible with fork(). I do not find a language to be useful, in general, unless it lets me fork and exec when I have to. -- John

At Thu, 1 Mar 2007 11:38:54 -0600, John Goerzen wrote:
On Thu, Mar 01, 2007 at 04:21:45PM +0000, Simon Marlow wrote:
Between that and the lack of support for forkProcess in Hugs, this renders anything that needs to fork and then do I/O as being usable only in GHC-compiled code. Which is sub-optimal, but livable anyway.
I guess I'm really wondering why you need to fork and do I/O at all. Can you describe the problem at a higher level?
I am, for all intents and purposes, writing what amounts to a simple shell.
The neat thing about the library is that external commands and haskell code can be freely intermixed, and are uniformly handled. For example, in this pipeline, r <- runS ("ls -l" -|- "grep i" -|- wcL ) wcL is a simple haskell function: wcL :: [String] -> [String] wcL inp = [show $ genericLength inp] The HSH library just creates some pipes to hook the processes together, and then forks of ls, grep, and wcL as seperate processes. The advantage of this scheme is that once the pipeline is started, everything behaves the same way it would if you had run the bash command: $ ls -l | grep i | wcL So, you get very familiar behaviour/performance from a shell scripting point of view. But, you also get to easily stick haskell functions in your pipeline. Poking around with the full HSH code, I *think* I got pipelines that *only* called external commands working fine[1]. This seems logical, since the external commands do not care about the Haskell I/O manager at all. So, perhaps you can have an alternate version of 'instance ShellCommand (String -> IO String)' that gets used for -threaded that uses forkOS instead of forkProcess. All of the external commands would still be forked into seperate processes, but all of the haskell commands would run in the same threaded process. Obviously, you would have to fake the return code, but it looks like that should be feasible. Some open questions are: a) how do you detect that you are running in the threaded RTS b) can you have the linker pick the correct version at link time, so that you do not have to have a compile-time check. Of course, a compile time check might only have to be done once, so the overhead would not be significant. j. [1] In fact, they may work fine out of the box, I haven't tested that.

John Goerzen wrote:
The standard way of implemeting pipes between two external programs in Unix involves setting up pipes and forking, then duping things to stdin/stdout, and execing the final program. In this case, I am setting it up to let people pipe to Haskell functions as well, forking off a process that works with pipes to handle them.
I know how all these things work in Unix, in C, in Python, etc.
I have no idea how all of this will interact if I were to use forkOS. It is not clear to me what the semantics of forkProcess, executeFile, signal handling, etc. are under a Haskell thread instead of a forked process. This is, as far as I can tell, completely undocumented in System.Posix.* and the subject of differing advice on the WWW.
We can certainly add any missing documentation - can you suggest specifically waht you'd like to see mentioned? forkProcess does say what happens when there are multiple threads, and I've added some more notes about I/O with -threaded. executeFile isn't affected by threads. Signal handling unfortunately won't work in the child of forkProcess, with -threaded, right now for the same reason that I/O doesn't work.
But let me add a voice to keeping the non-threaded RTS around. I have learned the hard way that the threaded RTS is ported only to a very few platforms, a distinct minority of the platforms that Debian supports, for instance. (Just like ghci). Whereas the non-threaded RTS is supported much more broadly (such as Alpha support). My own program hpodder has failed to build in Debian on many platforms because I didn't realize this going in.
Not only that, but it is apparent that the threaded RTS is simply inappropriate when a person is trying to do anything remotely low-level on the system. I would hate to have to become a Haskell refugee, going back to Python, because Haskell I/O has become incompatible with fork().
I do not find a language to be useful, in general, unless it lets me fork and exec when I have to.
I share your concerns, and doing low-level system programming of this kind is certainly something we want to support. I'd also consider it a serious problem if we were to lose the ability to write a shell in Haskell. (I should point out that it's not usualy fork+exec that causes problems, rather fork on its own - fork+exec is pretty well supported, we use it in the timeout program in the GHC testsuite for example). Many people really need the facilities that the threaded RTS provides (non-blocking foreign calls, SMP parallelism, writing thread-safe DLLs, etc.). So we have to provide these facilities without losing support for system programming. I accept we may have tipped the balance a little recently in favour of the funky new stuff; apologies for that, and thanks for bringing it up. We won't be throwing away the non-threaded RTS any time soon, certainly not while there are certain programs and platforms that only work with it. Regarding platform support for the threaded RTS: that's certainly a problem, which is mainly due to lack of resources. I'm sure most bugs are probably fairly shallow, since we only use Posix thread support. Regarding your shell: I would suggest trying forkIO for the Haskell "processes" (not forkOS unless for some reason you really need another OS thread). However, I can imagine that it might be hard to implement job control and signal handling in that system. You could also consider using System.Process.runInteractiveProcess, for portability. Cheers, Simon

Hello Simon, Friday, March 2, 2007, 1:07:07 PM, you wrote:
But let me add a voice to keeping the non-threaded RTS around.
i want to mention that problem here is not the threaded RTS by itself, but standard i/o library that works via separate i/o manager thread that is built-in part of RTS. my Streams library [1] don't uses this thread at all. for threads created with forkOS it provides excellent overlapping of I/O and computations (thanks, Simon, situation was *greatly* improved in 6.6). of course, it should be not so great for forkIO'd threads that i want to say is that future i/o lib may be written in the RTS-independent way. John Meacham once proposed to develop some common API for i/o managers that will allow to use various "select" variants with any i/o lib that works via this API -- Best regards, Bulat mailto:Bulat.Ziganshin@gmail.com

Bulat Ziganshin wrote:
Hello Simon,
Friday, March 2, 2007, 1:07:07 PM, you wrote:
But let me add a voice to keeping the non-threaded RTS around.
i want to mention that problem here is not the threaded RTS by itself, but standard i/o library that works via separate i/o manager thread that is built-in part of RTS.
The I/O manager thread is hardly built-in to the RTS. It is all in a Haskell library; the only connection with the RTS is that the RTS feeds signals to the I/O manager thread down a pipe, and in fact we could move this signal-handling code out of the RTS and into the base package too. ? my Streams library [1] don't uses this thread
at all. for threads created with forkOS it provides excellent overlapping of I/O and computations (thanks, Simon, situation was *greatly* improved in 6.6). of course, it should be not so great for forkIO'd threads
I don't understand why forkOS should be any different from forkIO in this context. Could you explain? There seems to be a common misconception that forkOS is necessary to get certain kinds of concurrency, and forkIO won't do. I don't know where this comes from: the documentation does seem to be quite clear to me. The only reason to use forkOS is for interacting with foreign code that uses thread-local state; everytyhing else can be done with forkIO (and it is usually better to use forkIO). Cheers, Simon

On Mon, Mar 05, 2007 at 12:59:17PM +0000, Simon Marlow wrote:
There seems to be a common misconception that forkOS is necessary to get certain kinds of concurrency, and forkIO won't do. I don't know where this comes from: the documentation does seem to be quite clear to me. The only reason to use forkOS is for interacting with foreign code that uses thread-local state; everytyhing else can be done with forkIO (and it is usually better to use forkIO).
From reading the docs, it sounds like forkIO keeps everything in a single OS thread/process. Doesn't this mean that a program that uses forkIO instead of forkOS loses out on SMP machines?

On Mon, Mar 05, 2007 at 08:36:29AM -0600, John Goerzen wrote:
On Mon, Mar 05, 2007 at 12:59:17PM +0000, Simon Marlow wrote:
There seems to be a common misconception that forkOS is necessary to get certain kinds of concurrency, and forkIO won't do. I don't know where this comes from: the documentation does seem to be quite clear to me. The only reason to use forkOS is for interacting with foreign code that uses thread-local state; everytyhing else can be done with forkIO (and it is usually better to use forkIO).
From reading the docs, it sounds like forkIO keeps everything in a single OS thread/process. Doesn't this mean that a program that uses forkIO instead of forkOS loses out on SMP machines?
You can use e.g. +RTS -N2 to use 2 OS threads. Thanks Ian

Ian Lynagh wrote:
On Mon, Mar 05, 2007 at 08:36:29AM -0600, John Goerzen wrote:
On Mon, Mar 05, 2007 at 12:59:17PM +0000, Simon Marlow wrote:
There seems to be a common misconception that forkOS is necessary to get certain kinds of concurrency, and forkIO won't do. I don't know where this comes from: the documentation does seem to be quite clear to me. The only reason to use forkOS is for interacting with foreign code that uses thread-local state; everytyhing else can be done with forkIO (and it is usually better to use forkIO). From reading the docs, it sounds like forkIO keeps everything in a single OS thread/process. Doesn't this mean that a program that uses forkIO instead of forkOS loses out on SMP machines?
You can use e.g. +RTS -N2 to use 2 OS threads.
I've added a sentence to the forkOS docs to say that you don't need forkOS to get parallelism. Cheers, Simon

On Mon, Mar 05, 2007 at 03:20:05PM +0000, Ian Lynagh wrote:
From reading the docs, it sounds like forkIO keeps everything in a single OS thread/process. Doesn't this mean that a program that uses forkIO instead of forkOS loses out on SMP machines?
You can use e.g. +RTS -N2 to use 2 OS threads.
That's rather ugly though, and doesn't "just work". With other languages, I could just use OS threads, and let the OS schedule, say, 15 threads across 2 CPUs, or 4 CPUs, or however I may have. -- John

John Goerzen wrote:
On Mon, Mar 05, 2007 at 03:20:05PM +0000, Ian Lynagh wrote:
From reading the docs, it sounds like forkIO keeps everything in a single OS thread/process. Doesn't this mean that a program that uses forkIO instead of forkOS loses out on SMP machines? You can use e.g. +RTS -N2 to use 2 OS threads.
That's rather ugly though, and doesn't "just work". With other languages, I could just use OS threads, and let the OS schedule, say, 15 threads across 2 CPUs, or 4 CPUs, or however I may have.
-- John
Choice is good, but it does mean the default may need to be tweaked, such as with those options. The main difference in how lightweight or heavyweight the threads are. Lightweight forkIO threads allow for tremendous performance, see the benchmarks here: http://shootout.alioth.debian.org/gp4/benchmark.php?test=chameneos&lang=all http://shootout.alioth.debian.org/gp4/benchmark.php?test=message&lang=all Those benchmarks are without using a "+RTS -N2" style thread pool. -- Chris

On Mon, Mar 05, 2007 at 10:23:53AM -0600, John Goerzen wrote:
On Mon, Mar 05, 2007 at 03:20:05PM +0000, Ian Lynagh wrote:
From reading the docs, it sounds like forkIO keeps everything in a single OS thread/process. Doesn't this mean that a program that uses forkIO instead of forkOS loses out on SMP machines?
You can use e.g. +RTS -N2 to use 2 OS threads.
That's rather ugly though, and doesn't "just work". With other languages, I could just use OS threads, and let the OS schedule, say, 15 threads across 2 CPUs, or 4 CPUs, or however I may have.
You can set a default -Nn as described in http://www.haskell.org/ghc/docs/latest/html/users_guide/runtime-control.html... if that helps. Thanks Ian

John Goerzen wrote:
On Mon, Mar 05, 2007 at 03:20:05PM +0000, Ian Lynagh wrote:
From reading the docs, it sounds like forkIO keeps everything in a single OS thread/process. Doesn't this mean that a program that uses forkIO instead of forkOS loses out on SMP machines? You can use e.g. +RTS -N2 to use 2 OS threads.
That's rather ugly though, and doesn't "just work". With other languages, I could just use OS threads, and let the OS schedule, say, 15 threads across 2 CPUs, or 4 CPUs, or however I may have.
One day we might make this automatic, but you're missing the main point: what GHC gives you is lightweight threads that scale transparently on a multiprocessor. You can create thousands of threads without worrying about performance, and therefore you are free to structure your program's concurrency according to the application's needs, not the demands of performance. You don't have to limit the number of threads and do event-driven programming just because threads are too expensive. And your program will scale on a multiprocessor without recompilation. Cheers, Simon

Hello Simon, Monday, March 5, 2007, 3:59:17 PM, you wrote:
my Streams library [1] don't uses this thread at all. for threads created with forkOS it provides excellent overlapping of I/O and computations (thanks, Simon, situation was *greatly* improved in 6.6). of course, it should be not so great for forkIO'd threads
I don't understand why forkOS should be any different from forkIO in this context. Could you explain?
There seems to be a common misconception that forkOS is necessary to get certain kinds of concurrency, and forkIO won't do. I don't know where this comes from: the documentation does seem to be quite clear to me. The only reason to use forkOS is for interacting with foreign code that uses thread-local state; everytyhing else can be done with forkIO (and it is usually better to use forkIO).
it may be entirely due my ignorance :) my program anyway uses -threaded and forkOS in order to run several C threads si,ultaneously and i don't performed tests in any other conditions so, one thread may read data from file, another thread write data and one more make compression using C routine. in 6.4, these tasks was overlapped only partially while in 6.6 they 100% overlap don't forget that read/write calls are also foreign calls, so while all C calls are marked as "safe", 6.4 doesn't overlap them good enough (also, to make things harder, C compression routine makes calls back to the Haskell routines). if haskell runtime will create new threads for executing other Haskell threads while one thread performs safe C call, then it should be ok. probably, i just mixed up forkOs and -threaded mode :) i remember that i had problems with forkIO. i will try it again and report results -- Best regards, Bulat mailto:Bulat.Ziganshin@gmail.com

Bulat Ziganshin wrote:
Hello Simon,
Monday, March 5, 2007, 3:59:17 PM, you wrote:
my Streams library [1] don't uses this thread at all. for threads created with forkOS it provides excellent overlapping of I/O and computations (thanks, Simon, situation was *greatly* improved in 6.6). of course, it should be not so great for forkIO'd threads
I don't understand why forkOS should be any different from forkIO in this context. Could you explain?
There seems to be a common misconception that forkOS is necessary to get certain kinds of concurrency, and forkIO won't do. I don't know where this comes from: the documentation does seem to be quite clear to me. The only reason to use forkOS is for interacting with foreign code that uses thread-local state; everytyhing else can be done with forkIO (and it is usually better to use forkIO).
it may be entirely due my ignorance :) my program anyway uses -threaded and forkOS in order to run several C threads si,ultaneously and i don't performed tests in any other conditions
so, one thread may read data from file, another thread write data and one more make compression using C routine. in 6.4, these tasks was overlapped only partially while in 6.6 they 100% overlap
don't forget that read/write calls are also foreign calls, so while all C calls are marked as "safe", 6.4 doesn't overlap them good enough (also, to make things harder, C compression routine makes calls back to the Haskell routines). if haskell runtime will create new threads for executing other Haskell threads while one thread performs safe C call, then it should be ok. probably, i just mixed up forkOs and -threaded mode :)
Ok, there was a complete rewrite of the scheduler between 6.4 and 6.6 so this may account for the differences you see. Beware of forkOS: it'll reduce performance on the Haskell side, because essentially each context switch between a forkOS'd thread and another thread is a complete OS-thread context switch, which is hundreds of times slower than context switching between forkIO'd threads. Cheers, Simon

On 2007-03-02, Simon Marlow
Regarding your shell: I would suggest trying forkIO for the Haskell "processes" (not forkOS unless for some reason you really need another OS thread). However, I can imagine that it might be hard to implement job control and signal handling in that system. You could also consider using System.Process.runInteractiveProcess, for portability.
Thinking about forkIO seems that it could be fairly complex to just drop in. The problem lies around file descriptors. When you fork off a new process, and then close file descriptors in the parent, they stay open in the child, and vice versa. Proper management of file descriptors is a critical part of a shell, and it's vital to close the proper set of FDs at the proper time in the parent and the child, or else bad things like pipes never closing could easily lead to deadlock. Of course, it is possible to work around this, but I fear that it could make the program very complex. -- John

John Goerzen wrote:
On 2007-03-02, Simon Marlow
wrote: Regarding your shell: I would suggest trying forkIO for the Haskell "processes" (not forkOS unless for some reason you really need another OS thread). However, I can imagine that it might be hard to implement job control and signal handling in that system. You could also consider using System.Process.runInteractiveProcess, for portability.
Thinking about forkIO seems that it could be fairly complex to just drop in. The problem lies around file descriptors. When you fork off a new process, and then close file descriptors in the parent, they stay open in the child, and vice versa. Proper management of file descriptors is a critical part of a shell, and it's vital to close the proper set of FDs at the proper time in the parent and the child, or else bad things like pipes never closing could easily lead to deadlock.
Of course, it is possible to work around this, but I fear that it could make the program very complex.
Admittedly I haven't completely thought this through, but my intuition was that you would be able to use forkIO at a higher level. That is, instead of just trying to replace forkProcess with forkIO, you replace forkProcess + pipes + FD handling with forkIO + lazy streams, for Haskell processes. So the way in which data is fed between processes depends on the process: Haskell processes talk to each other using lazy streams, external processes talk to each other over pipes, and at a boundary between the two you need a pipe with another Haskell thread to feed the pipe from a lazy stream, or vice-versa. Cheers, Simon

On Thu, Mar 1, 2007 at 5:21 PM, Simon Marlow
In fact you should think of the non-threaded RTS as deprecated. It isn't Haskell'-compliant, for one thing (assuming that Haskell' will probably require non-blocking foreign calls).
I'm hesitant to actually deprecate it, for a few reasons: the threaded RTS is so much more complicated, it might have some adverse performance impliciations, and there are still people who want to run everything in a single OS thread, for whatever reason. But having multiple variants of the RTS is a maintenance and testing headache.
Had you deprecated the non-threaded RTS, we would probably have no problems described in ticket #2848 :-/ I think you'll have to deprecate it anyway, because it will be more and more difficult to maintain two versions of code, especially if one of them will be much less used and tested. Best regards Tomasz

Hello Tomasz, Saturday, December 6, 2008, 10:52:39 PM, you wrote:
Had you deprecated the non-threaded RTS, we would probably have no problems described in ticket #2848 :-/
I think you'll have to deprecate it anyway, because it will be more and more difficult to maintain two versions of code, especially if one of them will be much less used and tested.
we may conduct small survey on amount of usage of old RTS (i mean ask this in haskell-cafe) -- Best regards, Bulat mailto:Bulat.Ziganshin@gmail.com

Hi Bulat, My contribution to the survey: I've used forkProcess to daemonize a ghc program inside the haskell fuse bindings: http://hackage.haskell.org/cgi-bin/hackage-scripts/package/HFuse http://code.haskell.org/hfuse/System/Fuse.hsc If removing the non-threaded RTS would break forkProcess entirely, these bindings would have to do something different. The issue: users of the FUSE C api will get daemonized using daemon(2); it'd be nice if GHC fuse programs could behave similarly. Thanks, Brian Bloniarz
Hello Tomasz,
Saturday, December 6, 2008, 10:52:39 PM, you wrote:
Had you deprecated the non-threaded RTS, we would probably have no problems described in ticket #2848 :-/
I think you'll have to deprecate it anyway, because it will be more and more difficult to maintain two versions of code, especially if one of them will be much less used and tested.
we may conduct small survey on amount of usage of old RTS (i mean ask this in haskell-cafe)
-- Best regards, Bulat mailto:Bulat.Ziganshin@gmail.com
_______________________________________________ Glasgow-haskell-users mailing list Glasgow-haskell-users@haskell.org http://www.haskell.org/mailman/listinfo/glasgow-haskell-users
_________________________________________________________________ Connect to the next generation of MSN Messenger http://imagine-msn.com/messenger/launch80/default.aspx?locale=en-us&source=wlmailtagline

Brian B wrote:
Hi Bulat,
My contribution to the survey: I've used forkProcess to daemonize a ghc program inside the haskell fuse bindings: http://hackage.haskell.org/cgi-bin/hackage-scripts/package/HFuse http://code.haskell.org/hfuse/System/Fuse.hsc
If removing the non-threaded RTS would break forkProcess entirely, these bindings would have to do something different. The issue: users of the FUSE C api will get daemonized using daemon(2); it'd be nice if GHC fuse programs could behave similarly.
I also use forkProcess extensively: in HSH, for instance, which is used by hpodder, twidge, and a host of other tools. Removing the ability to use forkProcess removes the ability to write a Unix shell in Haskell, or to do anything shell-like, or anything even mildly advanced involving piping, file descriptors, and the like. I would see it as a significant regression. The System.Process calls, last I checked (in 6.8.x) were both too buggy to use for complex tasks, and too inadequate for some (though the adequacy has been improving.) -- John

John Goerzen wrote:
Brian B wrote:
Hi Bulat,
My contribution to the survey: I've used forkProcess to daemonize a ghc program inside the haskell fuse bindings: http://hackage.haskell.org/cgi-bin/hackage-scripts/package/HFuse http://code.haskell.org/hfuse/System/Fuse.hsc
If removing the non-threaded RTS would break forkProcess entirely, these bindings would have to do something different. The issue: users of the FUSE C api will get daemonized using daemon(2); it'd be nice if GHC fuse programs could behave similarly.
I also use forkProcess extensively: in HSH, for instance, which is used by hpodder, twidge, and a host of other tools. Removing the ability to use forkProcess removes the ability to write a Unix shell in Haskell, or to do anything shell-like, or anything even mildly advanced involving piping, file descriptors, and the like. I would see it as a significant regression.
Have you tried those apps with the threaded RTS? I'd be interested to know whether they work as expected. I'm not suggesting we remove the non-threaded RTS, however perhaps there's an argument for making -threaded the default. After all, that's what you get with GHCi by default right now. Maintaining both versions of the RTS is certainly a burden, but I think it's one we have to carry, since there are still reasons to want both.
The System.Process calls, last I checked (in 6.8.x) were both too buggy to use for complex tasks, and too inadequate for some (though the adequacy has been improving.)
If there's bugginess we need to get it fixed - please report those bugs! Cheers, Simon

Simon Marlow wrote:
John Goerzen wrote:
Hi Bulat,
My contribution to the survey: I've used forkProcess to daemonize a ghc program inside the haskell fuse bindings: http://hackage.haskell.org/cgi-bin/hackage-scripts/package/HFuse http://code.haskell.org/hfuse/System/Fuse.hsc
If removing the non-threaded RTS would break forkProcess entirely, these bindings would have to do something different. The issue: users of the FUSE C api will get daemonized using daemon(2); it'd be nice if GHC fuse programs could behave similarly. I also use forkProcess extensively: in HSH, for instance, which is used by hpodder, twidge, and a host of other tools. Removing the ability to use forkProcess removes the ability to write a Unix shell in Haskell, or to do anything shell-like, or anything even mildly advanced involving
Brian B wrote: piping, file descriptors, and the like. I would see it as a significant regression.
Have you tried those apps with the threaded RTS? I'd be interested to know whether they work as expected.
I have, and it didn't work well. But it's been awhile, and I can't tell you anymore what version of GHC or what exactly the problem was. I was most certainly 6.8 or older. Once 6.10 hits Debian, I could test again there. But see below...
I'm not suggesting we remove the non-threaded RTS, however perhaps there's an argument for making -threaded the default. After all, that's what you get with GHCi by default right now.
That's probably an OK solution. I would also add: does the threaded RTS support all platforms? For instance, GHC runs on my Alpha and on AIX, unregisterised. ghci doesn't run there, but GHC does. If you drop the non-threaded RTS, does that mean that GHC doesn't work there at all?
The System.Process calls, last I checked (in 6.8.x) were both too buggy to use for complex tasks, and too inadequate for some (though the adequacy has been improving.)
If there's bugginess we need to get it fixed - please report those bugs!
Already done: http://hackage.haskell.org/trac/ghc/ticket/1780 (still open since Nov 2007) There was also a thread here regarding problems with the threaded RTS: http://www.mail-archive.com/glasgow-haskell-users@haskell.org/msg11573.html Not sure if that has been fixed, or was an error on my part, but see your reply at: http://www.mail-archive.com/glasgow-haskell-users@haskell.org/msg11585.html I admit I haven't had the chance to reread that whole thread, so my apologies if this is a red herring. -- John

John Goerzen wrote:
Simon Marlow wrote:
John Goerzen wrote:
Hi Bulat,
My contribution to the survey: I've used forkProcess to daemonize a ghc program inside the haskell fuse bindings: http://hackage.haskell.org/cgi-bin/hackage-scripts/package/HFuse http://code.haskell.org/hfuse/System/Fuse.hsc
If removing the non-threaded RTS would break forkProcess entirely, these bindings would have to do something different. The issue: users of the FUSE C api will get daemonized using daemon(2); it'd be nice if GHC fuse programs could behave similarly. I also use forkProcess extensively: in HSH, for instance, which is used by hpodder, twidge, and a host of other tools. Removing the ability to use forkProcess removes the ability to write a Unix shell in Haskell, or to do anything shell-like, or anything even mildly advanced involving
Brian B wrote: piping, file descriptors, and the like. I would see it as a significant regression. Have you tried those apps with the threaded RTS? I'd be interested to know whether they work as expected.
I have, and it didn't work well. But it's been awhile, and I can't tell you anymore what version of GHC or what exactly the problem was. I was most certainly 6.8 or older. Once 6.10 hits Debian, I could test again there. But see below...
I'm not suggesting we remove the non-threaded RTS, however perhaps there's an argument for making -threaded the default. After all, that's what you get with GHCi by default right now.
That's probably an OK solution.
I would also add: does the threaded RTS support all platforms? For instance, GHC runs on my Alpha and on AIX, unregisterised. ghci doesn't run there, but GHC does. If you drop the non-threaded RTS, does that mean that GHC doesn't work there at all?
If those platforms support threads, there's no reason why the threaded RTS shouldn't work there. Also, GHCi should work on all platforms (even unregisterised) these days, including the FFI if there's support in libffi for that platform. However, if if the threaded RTS doesn't work on a platform for some reason, that doesn't prevent us just falling back to the non-threaded RTS for that platform. Most things will still work.
The System.Process calls, last I checked (in 6.8.x) were both too buggy to use for complex tasks, and too inadequate for some (though the adequacy has been improving.) If there's bugginess we need to get it fixed - please report those bugs!
Already done:
http://hackage.haskell.org/trac/ghc/ticket/1780 (still open since Nov 2007)
That one is closed - fixed in 6.8.3 I think.
There was also a thread here regarding problems with the threaded RTS:
http://www.mail-archive.com/glasgow-haskell-users@haskell.org/msg11573.html
Not sure if that has been fixed, or was an error on my part, but see your reply at:
http://www.mail-archive.com/glasgow-haskell-users@haskell.org/msg11585.html
I did make a ticket for that: http://hackage.haskell.org/trac/ghc/ticket/1185 That should be fixable - I'll put it on the current milestone. Cheers, Simon

Simon Marlow wrote:
I would also add: does the threaded RTS support all platforms? For instance, GHC runs on my Alpha and on AIX, unregisterised. ghci doesn't run there, but GHC does. If you drop the non-threaded RTS, does that mean that GHC doesn't work there at all?
If those platforms support threads, there's no reason why the threaded RTS shouldn't work there. Also, GHCi should work on all platforms (even unregisterised) these days, including the FFI if there's support in libffi for that platform.
That's very good to hear.
http://hackage.haskell.org/trac/ghc/ticket/1780 (still open since Nov 2007)
That one is closed - fixed in 6.8.3 I think.
Oops, my mistake. I'll look into it again.
http://www.mail-archive.com/glasgow-haskell-users@haskell.org/msg11585.html
I did make a ticket for that:
#1 way to tell you are a Haskell geek: * milestone changed from _|_ to 6.10.2. <grin> Thanks, Simon. -- John

Brian B wrote:
Hi Bulat,
My contribution to the survey: I've used forkProcess to daemonize a ghc program inside the haskell fuse bindings: http://hackage.haskell.org/cgi-bin/hackage-scripts/package/HFuse http://code.haskell.org/hfuse/System/Fuse.hsc
If removing the non-threaded RTS would break forkProcess entirely, these bindings would have to do something different. The issue: users of the FUSE C api will get daemonized using daemon(2); it'd be nice if GHC fuse programs could behave similarly.
forkProcess should work with the threaded RTS, as long as you don't enable multiple cores with +RTS -N<n>. However, forking is a pretty tricky operation in a multi-threaded environment, and that's where the difficulty comes from. Cheers, Simon
Thanks, Brian Bloniarz
Hello Tomasz,
Saturday, December 6, 2008, 10:52:39 PM, you wrote:
Had you deprecated the non-threaded RTS, we would probably have no problems described in ticket #2848 :-/
I think you'll have to deprecate it anyway, because it will be more and more difficult to maintain two versions of code, especially if one of them will be much less used and tested.
we may conduct small survey on amount of usage of old RTS (i mean ask this in haskell-cafe)
-- Best regards, Bulat mailto:Bulat.Ziganshin@gmail.com
_______________________________________________ Glasgow-haskell-users mailing list Glasgow-haskell-users@haskell.org http://www.haskell.org/mailman/listinfo/glasgow-haskell-users
------------------------------------------------------------------------ Connect to the next generation of MSN Messenger Get it now! http://imagine-msn.com/messenger/launch80/default.aspx?locale=en-us&source=wlmailtagline
------------------------------------------------------------------------
_______________________________________________ Glasgow-haskell-users mailing list Glasgow-haskell-users@haskell.org http://www.haskell.org/mailman/listinfo/glasgow-haskell-users

Had you deprecated the non-threaded RTS, we would probably have no problems described in ticket #2848 :-/
I think you'll have to deprecate it anyway, because it will be more and more difficult to maintain two versions of code,
we may conduct small survey on amount of usage of old RTS (i mean ask this in haskell-cafe)
For the only application I tried, using the threaded RTS imposes a 100% performance penalty - i.e. computation time doubles, compared to the non-threaded RTS. This was with ghc-6.8.2, and maybe the overhead has improved since then? Regards, Malcolm

Malcolm Wallace wrote:
Had you deprecated the non-threaded RTS, we would probably have no problems described in ticket #2848 :-/ I think you'll have to deprecate it anyway, because it will be more and more difficult to maintain two versions of code, we may conduct small survey on amount of usage of old RTS (i mean ask this in haskell-cafe)
For the only application I tried, using the threaded RTS imposes a 100% performance penalty - i.e. computation time doubles, compared to the non-threaded RTS. This was with ghc-6.8.2, and maybe the overhead has improved since then?
This is a guess, but I wonder if this program is concurrent, and does a lot of communication between the main thread and other threads? The main thread is a bound thread, which means that communication between the main thread and any other thread is much more expensive than communication between unbound threads, because it involves full OS-level context switches. In a concurrent program, don't use the main thread to do any real work, do a forkIO and wait for the child to complete. Certainly a 2x performance overhead for the threaded RTS is not something we normally see. There will be an overhead for MVars and STM, but even then I'd consider 2x to be deeply suspicious. For most programs, the overhead should be close to zero. Cheers, Simon

Simon Marlow
Malcolm Wallace wrote:
For the only application I tried, using the threaded RTS imposes a 100% performance penalty - i.e. computation time doubles, compared to the non-threaded RTS. This was with ghc-6.8.2, and maybe the overhead has improved since then?
This is a guess, but I wonder if this program is concurrent, and does a lot of communication between the main thread and other threads?
Exactly so - it hits the worst case behaviour. This was a naive attempt to parallelise an algorithm by shifting some work onto a spare processor. Unfortunately, there is a lot of communication to the main thread, because the work that was shifted elsewhere computes a large data structure in chunks, and passes those chunks back. The main thread then runs OpenGL calls using this data -- and I believe OpenGL calls must run in a bound thread. This all suggests that one consequence of ghc's RTS implementation choices is that it will never be cheap to compute visualization data in parallel with rendering it in OpenGL. That would be a shame. This was exactly the parallelism I was hoping for. Regards, Malcolm

Malcolm Wallace wrote:
Simon Marlow
wrote: Malcolm Wallace wrote:
For the only application I tried, using the threaded RTS imposes a 100% performance penalty - i.e. computation time doubles, compared to the non-threaded RTS. This was with ghc-6.8.2, and maybe the overhead has improved since then? This is a guess, but I wonder if this program is concurrent, and does a lot of communication between the main thread and other threads?
Exactly so - it hits the worst case behaviour. This was a naive attempt to parallelise an algorithm by shifting some work onto a spare processor. Unfortunately, there is a lot of communication to the main thread, because the work that was shifted elsewhere computes a large data structure in chunks, and passes those chunks back. The main thread then runs OpenGL calls using this data -- and I believe OpenGL calls must run in a bound thread.
This all suggests that one consequence of ghc's RTS implementation choices is that it will never be cheap to compute visualization data in parallel with rendering it in OpenGL. That would be a shame. This was exactly the parallelism I was hoping for.
I'm not sure how we could do any better here. To get parallelism you need to run the OpenGL thread and the worker thread on separate OS threads, which we do. So what aspect of the RTS design is preventing you from getting the parallelism you want? It seems that the problem you have is that moving to the multithreaded runtime imposes an overhead on the communication between your two threads, when run on a *single CPU*. But performance on a single CPU is not what you're interested in - you said you wanted parallelism, and for that you need multiple CPUs, and hence multiple OS threads. I suspect the underlying problem in your program is that the communication is synchronous. To get good parallelism you'll need to use asynchronous communication, otherwise even on multiple CPUs you'll see little parallelism. If you still do asynchronous communication and yet don't get good parallelism, then we should look into what's causing that. Cheers, Simon

It seems that the problem you have is that moving to the multithreaded runtime imposes an overhead on the communication between your two threads, when run on a *single CPU*. But performance on a single CPU is not what you're interested in - you said you wanted parallelism, and for that you need multiple CPUs, and hence multiple OS threads.
Well, I'm interested in getting an absolute speedup. If the threaded performance on a single core is slightly slower than the non-threaded performance on a single core, that would be OK provided that the threaded performance using multiple cores was better than the same non-threaded baseline. However, it doesn't seem to work like that at all. In fact, threaded on multiple cores was _even_slower_ than threaded on a single core! Here are some figures: ghc-6.8.2 -O2 apply MVar strict thr-N2 thr-N1 silicium 7.30 7.95 7.23 15.25 14.71 neghip 4.25 4.43 4.18 6.67 6.48 hydrogen 11.75 10.82 10.99 13.45 12.96 lobster 55.8 51.5 57.6 76.6 74.5 The first three columns are variations of the program using slightly different communications mechanisms, including threads/MVars with the non-threaded RTS. The final two columns are for the MVar mechanism with threaded RTS and either 1 or 2 cores. -N2 is slowest.
I suspect the underlying problem in your program is that the communication is synchronous. To get good parallelism you'll need to use asynchronous communication, otherwise even on multiple CPUs you'll see little parallelism.
I tried using Chans instead of MVars, to provide for different speeds of reader/writer, but the timings were even worse. (Add another 15-100%.) When I have time to look at this again (probably in the New Year), I will try some other strategies for communication that vary in their synchronous/asynchronous chunk size, to see if I can pin things down more closely. Regards, Malcolm

Malcolm Wallace wrote:
It seems that the problem you have is that moving to the multithreaded runtime imposes an overhead on the communication between your two threads, when run on a *single CPU*. But performance on a single CPU is not what you're interested in - you said you wanted parallelism, and for that you need multiple CPUs, and hence multiple OS threads.
Well, I'm interested in getting an absolute speedup. If the threaded performance on a single core is slightly slower than the non-threaded performance on a single core, that would be OK provided that the threaded performance using multiple cores was better than the same non-threaded baseline.
However, it doesn't seem to work like that at all. In fact, threaded on multiple cores was _even_slower_ than threaded on a single core!
Entirely possible - unless there's any actual parallelism, running on multiple cores will probably slow things down due to thread migration.
Here are some figures:
ghc-6.8.2 -O2 apply MVar strict thr-N2 thr-N1 silicium 7.30 7.95 7.23 15.25 14.71 neghip 4.25 4.43 4.18 6.67 6.48 hydrogen 11.75 10.82 10.99 13.45 12.96 lobster 55.8 51.5 57.6 76.6 74.5
The first three columns are variations of the program using slightly different communications mechanisms, including threads/MVars with the non-threaded RTS. The final two columns are for the MVar mechanism with threaded RTS and either 1 or 2 cores. -N2 is slowest.
So you're not getting any parallelism at all, for some reason your program is sequentialised. There could be any number of reasons for this.
I suspect the underlying problem in your program is that the communication is synchronous. To get good parallelism you'll need to use asynchronous communication, otherwise even on multiple CPUs you'll see little parallelism.
I tried using Chans instead of MVars, to provide for different speeds of reader/writer, but the timings were even worse. (Add another 15-100%.)
That would seem to indicate that your program is doing a lot of communication - I'd look at trying to reduce that, by increasing task size or whatever. However, the amount of communication is obviously not the only issue, there also seems to be some kind of dependency that sequentialises the program. Are you sure that you're not accidentally communicating thunks, and hence doing all the computation in one of the threads? That's a common pitfall that has caught me more than once. Do you know roughly the amount of parallelism you expect - i.e. the amount of work done by each thread?
When I have time to look at this again (probably in the New Year), I will try some other strategies for communication that vary in their synchronous/asynchronous chunk size, to see if I can pin things down more closely.
That would be good. At some point we hope to provide some kind of visualisation to let you see where the parallel performance bottlenecks in your program are; there are various ongoing efforts but nothing useable as yet. Cheers, Simon

Using "ghc -O2" while tuning a class instance for performance, I obtained a 13% speedup by applying the transformation
instance (Ord a, Num b) ⇒ Sum PSum a b where empty = empty insert = insert union = union unions = unions extractMin = extractMin fromList = fromList toList = toList map = map mapMaybe = mapMaybe
and defining the instance functions outside the instance declaration, rather than inside the instance declaration. Conceptually, I understand this as follows: After this transformation, none of the recursive calls have to go through the class dictionary. Is this a transformation that ghc could automatically apply while optimizing? It is clear at compile time that the recursive calls are from this instance. Every 13% helps. (My example is adapting a pairing heap to sums of terms, e.g. to define an algebra from a small category. So far, I can't beat Data.Map, but I'm not done tuning. Other examples might not show the same performance increase from this transformation.)

Which version of GHC are you using? GHC 6.10 implements automatically precisely the transformation you give below. If the difference shows up in GHC 6.10, could you spare a moment to produce a reproducible test case, and record it in GHC's bug tracker? Thanks Simon | -----Original Message----- | From: glasgow-haskell-users-bounces@haskell.org [mailto:glasgow-haskell-users- | bounces@haskell.org] On Behalf Of Dave Bayer | Sent: 28 December 2008 15:29 | To: glasgow-haskell-users@haskell.org | Subject: ghc -O2 and class dictionaries | | Using "ghc -O2" while tuning a class instance for performance, I | obtained a 13% speedup by applying the transformation | | > instance (Ord a, Num b) ⇒ Sum PSum a b where | > empty = empty | > insert = insert | > union = union | > unions = unions | > extractMin = extractMin | > fromList = fromList | > toList = toList | > map = map | > mapMaybe = mapMaybe | | and defining the instance functions outside the instance declaration, | rather than inside the instance declaration. | | Conceptually, I understand this as follows: After this transformation, | none of the recursive calls have to go through the class dictionary. | | Is this a transformation that ghc could automatically apply while | optimizing? It is clear at compile time that the recursive calls are | from this instance. Every 13% helps. | | (My example is adapting a pairing heap to sums of terms, e.g. to | define an algebra from a small category. So far, I can't beat | Data.Map, but I'm not done tuning. Other examples might not show the | same performance increase from this transformation.) | | _______________________________________________ | Glasgow-haskell-users mailing list | Glasgow-haskell-users@haskell.org | http://www.haskell.org/mailman/listinfo/glasgow-haskell-users

Yeah, I knew it was fairly unlikely that I was the first to think of this optimization ;-) I just reported the "run-time performance bug" as http://hackage.haskell.org/trac/ghc/ticket/2902 I am using an Intel Core 2 Duo MacBook and GHC 6.10.1, OS X 10.5.6. For the toy example that I submitted, the difference is over a factor of 3x. Thanks, Dave On Dec 29, 2008, at 6:23 AM, Simon Peyton-Jones wrote:
Which version of GHC are you using? GHC 6.10 implements automatically precisely the transformation you give below.
If the difference shows up in GHC 6.10, could you spare a moment to produce a reproducible test case, and record it in GHC's bug tracker?
Thanks
Simon
participants (13)
-
Brian B
-
Bulat Ziganshin
-
Chris Kuklewicz
-
Dave Bayer
-
Duncan Coutts
-
Ian Lynagh
-
Jeremy Shaw
-
John Goerzen
-
Malcolm Wallace
-
Simon Marlow
-
Simon Marlow
-
Simon Peyton-Jones
-
Tomasz Zielonka