Python's big challenges, Haskell's big advantages?

http://www.heise-online.co.uk/open/Shuttleworth-Python-needs-to-focus-on-fut... "cloud computing, transactional memory and future multicore processors" Get writing that multicore, STM, web app code! -- Don

Amen, Those are the hard question that the python community needs to answer (I am not sure they want to answer, tho). They are also part of the reasons we are switching to Haskell. L. Don Stewart wrote:
http://www.heise-online.co.uk/open/Shuttleworth-Python-needs-to-focus-on-fut...
"cloud computing, transactional memory and future multicore processors"
Get writing that multicore, STM, web app code!
-- Don _______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
-- Best Regards, lionel Barret de Nazaris, Gamr7 Founder & CTO ========================= Gamr7 : Cities for Games http://www.gamr7.com

Don Stewart ha scritto:
http://www.heise-online.co.uk/open/Shuttleworth-Python-needs-to-focus-on-fut...
"cloud computing, transactional memory and future multicore processors"
Multicore support is already "supported" in Python, if you use multiprocessing, instead of multithreading. And scalability is not a "real" problem, if you write RESTful web applications.
Get writing that multicore, STM, web app code!
Manlio Perillo

Multicore support is already "supported" in Python, if you use multiprocessing, instead of multithreading.
This is one of the reasons for my other question on this list, about whether you can solve all problems using multiple isolated processes with message passing. -- Bruce Eckel

Hi Bruce,
On Wed, Sep 17, 2008 at 15:03, Bruce Eckel
Multicore support is already "supported" in Python, if you use multiprocessing, instead of multithreading.
This is one of the reasons for my other question on this list, about whether you can solve all problems using multiple isolated processes with message passing.
Well, processing (the Python module) serves as a good example how shared memory can be emulated through message passing. Of course, performance takes a hit but since it is out there and being used by people it should tell us that is really af feasible solution. I guess the gist of the answers people posted to your question still remains, the answer depends on if you consider performance as part of the "power" of each approach. Arnar

Bruce Eckel
Manlio Perillo
wrote: Multicore support is already "supported" in Python, if you use multiprocessing, instead of multithreading.
This is one of the reasons for my other question on this list, about whether you can solve all problems using multiple isolated processes with message passing.
What did you get out of that, by the way? -- _jsn

Hi Manlio and others,
On Wed, Sep 17, 2008 at 14:58, Manlio Perillo
http://www.heise-online.co.uk/open/Shuttleworth-Python-needs-to-focus-on-fut...
"cloud computing, transactional memory and future multicore processors"
Multicore support is already "supported" in Python, if you use multiprocessing, instead of multithreading.
Well, I'm a huge Python fan myself, but multiprocessing is not really a solution as much as it is a workaround. Python as a language has no problem with multithreading and multicore support and has all primitives to do conventional shared-state parallelism. However, the most popular /implementation/ of Python sacrifies this for performance, it has nothing to do with the language itself. Stackless Python is an interesting implementation of the CSP+channels paradigm though. It has been quite successfully used for a few large projects.
And scalability is not a "real" problem, if you write RESTful web applications.
Of course scalability is a "real" problem, ask anyone who runs a big website. I don't see how RESTful design simply removes that problem. cheers, Arnar

Well, I'm a huge Python fan myself, but multiprocessing is not really a solution as much as it is a workaround. Python as a language has no problem with multithreading and multicore support and has all primitives to do conventional shared-state parallelism. However, the most popular /implementation/ of Python sacrifies this for performance, it has nothing to do with the language itself.
Actually, no. Neither Python nor Ruby can utilize more than a single processor using threads. The only way to use more than one processor is with processes. -- Bruce Eckel

Hi again,
On Wed, Sep 17, 2008 at 15:13, Bruce Eckel
Well, I'm a huge Python fan myself, but multiprocessing is not really a solution as much as it is a workaround. Python as a language has no problem with multithreading and multicore support and has all primitives to do conventional shared-state parallelism. However, the most popular /implementation/ of Python sacrifies this for performance, it has nothing to do with the language itself.
Actually, no. Neither Python nor Ruby can utilize more than a single processor using threads. The only way to use more than one processor is with processes.
I wanted to make a distinction between the language and its implementation. I think you are conflating the two. If you read the Python specification there is nothing preventing you from running on two cores in parallel. The standard library does indeed have semaphores, monitors, locks etc. In fact, I'm pretty sure the Jython implementation can use multiple cores. It is just CPython that can't, as is very well known and advertised. cheers, Arnar

Both Jython and JRuby can use multicore parallelism. Which, of
course, you need desperately when running in Jython and JRuby, because
they're slow as christmas for most tasks. In addition, Jython is not
a predictably complete version of Python and its internals are not
well documented in the least, and the documentation for what CPython
code will work in Jython and what won't is sadly lacking.
In my experience, it doesn't make it an unusable tool, but the tasks
it is suited for fall more along the lines of traditional scripting of
a large working Java application. I wouldn't want to see a large app
written in Jython or JRuby.
-- Jeff
On Wed, Sep 17, 2008 at 9:18 AM, Arnar Birgisson
Hi again,
On Wed, Sep 17, 2008 at 15:13, Bruce Eckel
wrote: Well, I'm a huge Python fan myself, but multiprocessing is not really a solution as much as it is a workaround. Python as a language has no problem with multithreading and multicore support and has all primitives to do conventional shared-state parallelism. However, the most popular /implementation/ of Python sacrifies this for performance, it has nothing to do with the language itself.
Actually, no. Neither Python nor Ruby can utilize more than a single processor using threads. The only way to use more than one processor is with processes.
I wanted to make a distinction between the language and its implementation. I think you are conflating the two.
If you read the Python specification there is nothing preventing you from running on two cores in parallel. The standard library does indeed have semaphores, monitors, locks etc. In fact, I'm pretty sure the Jython implementation can use multiple cores. It is just CPython that can't, as is very well known and advertised.
cheers, Arnar _______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
-- I try to take things like a crow; war and chaos don't always ruin a picnic, they just mean you have to be careful what you swallow. -- Jessica Edwards

Jython 2.5 is very close to release and its goal is to be a very
complete implementation, including such improbable things as ctypes.
You can indeed use the underlying threads of the JVM with Jython and
JRuby, but the native Python threads are prevented from running on
more than one processor by the GIL.
On Wed, Sep 17, 2008 at 10:23 AM, Jefferson Heard
Both Jython and JRuby can use multicore parallelism. Which, of course, you need desperately when running in Jython and JRuby, because they're slow as christmas for most tasks. In addition, Jython is not a predictably complete version of Python and its internals are not well documented in the least, and the documentation for what CPython code will work in Jython and what won't is sadly lacking.
In my experience, it doesn't make it an unusable tool, but the tasks it is suited for fall more along the lines of traditional scripting of a large working Java application. I wouldn't want to see a large app written in Jython or JRuby.
-- Jeff
On Wed, Sep 17, 2008 at 9:18 AM, Arnar Birgisson
wrote: Hi again,
On Wed, Sep 17, 2008 at 15:13, Bruce Eckel
wrote: Well, I'm a huge Python fan myself, but multiprocessing is not really a solution as much as it is a workaround. Python as a language has no problem with multithreading and multicore support and has all primitives to do conventional shared-state parallelism. However, the most popular /implementation/ of Python sacrifies this for performance, it has nothing to do with the language itself.
Actually, no. Neither Python nor Ruby can utilize more than a single processor using threads. The only way to use more than one processor is with processes.
I wanted to make a distinction between the language and its implementation. I think you are conflating the two.
If you read the Python specification there is nothing preventing you from running on two cores in parallel. The standard library does indeed have semaphores, monitors, locks etc. In fact, I'm pretty sure the Jython implementation can use multiple cores. It is just CPython that can't, as is very well known and advertised.
cheers, Arnar _______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
-- I try to take things like a crow; war and chaos don't always ruin a picnic, they just mean you have to be careful what you swallow.
-- Jessica Edwards
-- Bruce Eckel

Arnar Birgisson ha scritto:
Hi Manlio and others,
On Wed, Sep 17, 2008 at 14:58, Manlio Perillo
wrote: http://www.heise-online.co.uk/open/Shuttleworth-Python-needs-to-focus-on-fut...
"cloud computing, transactional memory and future multicore processors"
Multicore support is already "supported" in Python, if you use multiprocessing, instead of multithreading.
Well, I'm a huge Python fan myself, but multiprocessing is not really a solution as much as it is a workaround.
The real workaround is probably (and IMHO) multithreading (at least preemptive).
Python as a language has no problem with multithreading and multicore support and has all primitives to do conventional shared-state parallelism. However, the most popular /implementation/ of Python sacrifies this for performance, it has nothing to do with the language itself.
Stackless Python is an interesting implementation of the CSP+channels paradigm though. It has been quite successfully used for a few large projects.
There is also greenlets for cooperative multithreading (but without the scheduler and channels, so you can integrate it in your event loop like Twisted).
And scalability is not a "real" problem, if you write RESTful web applications.
Of course scalability is a "real" problem, ask anyone who runs a big website. I don't see how RESTful design simply removes that problem.
If you use asynchronous programming and multiprocessing, you *do* solve most of the problems. This is what I do in the wsgi module for Nginx.
cheers, Arnar
Manlio Perillo

On 2008-09-17, Arnar Birgisson
Hi Manlio and others,
On Wed, Sep 17, 2008 at 14:58, Manlio Perillo
wrote: http://www.heise-online.co.uk/open/Shuttleworth-Python-needs-to-focus-on-fut...
"cloud computing, transactional memory and future multicore processors"
Multicore support is already "supported" in Python, if you use multiprocessing, instead of multithreading.
Well, I'm a huge Python fan myself, but multiprocessing is not really a solution as much as it is a workaround. Python as a language has no problem with multithreading and multicore support and has all primitives to do conventional shared-state parallelism. However, the most popular /implementation/ of Python sacrifies this for performance, it has nothing to do with the language itself.
Huh. I see multi-threading as a workaround for expensive processes, which can explicitly use shared memory when that makes sense. -- Aaron Denney -><-

On Wed, 2008-09-17 at 18:40 +0000, Aaron Denney wrote:
On 2008-09-17, Arnar Birgisson
wrote: Hi Manlio and others,
On Wed, Sep 17, 2008 at 14:58, Manlio Perillo
wrote: http://www.heise-online.co.uk/open/Shuttleworth-Python-needs-to-focus-on-fut...
"cloud computing, transactional memory and future multicore processors"
Multicore support is already "supported" in Python, if you use multiprocessing, instead of multithreading.
Well, I'm a huge Python fan myself, but multiprocessing is not really a solution as much as it is a workaround. Python as a language has no problem with multithreading and multicore support and has all primitives to do conventional shared-state parallelism. However, the most popular /implementation/ of Python sacrifies this for performance, it has nothing to do with the language itself.
Huh. I see multi-threading as a workaround for expensive processes, which can explicitly use shared memory when that makes sense.
That breaks down when you want 1000s of threads. I'm not aware of any program, on any system, that spawns a new process on each event it wants to handle concurrently; systems that don't use an existing user-space thread library (such as Concurrent Haskell or libthread [1]) emulate user-space threads by keeping a pool of processors and re-using them (e.g., IIUC Apache does this). Any counter-examples? jcc [1] http://swtch.com/plan9port/man/man3/thread.html

On 2008-09-17, Jonathan Cast
On Wed, 2008-09-17 at 18:40 +0000, Aaron Denney wrote:
On 2008-09-17, Arnar Birgisson
wrote: Hi Manlio and others,
On Wed, Sep 17, 2008 at 14:58, Manlio Perillo
wrote: http://www.heise-online.co.uk/open/Shuttleworth-Python-needs-to-focus-on-fut...
"cloud computing, transactional memory and future multicore processors"
Multicore support is already "supported" in Python, if you use multiprocessing, instead of multithreading.
Well, I'm a huge Python fan myself, but multiprocessing is not really a solution as much as it is a workaround. Python as a language has no problem with multithreading and multicore support and has all primitives to do conventional shared-state parallelism. However, the most popular /implementation/ of Python sacrifies this for performance, it has nothing to do with the language itself.
Huh. I see multi-threading as a workaround for expensive processes, which can explicitly use shared memory when that makes sense.
That breaks down when you want 1000s of threads.
This really misses the point I was going for. I don't want 1000s of threads. I don't want *any* threads. Processes are nice because you don't have other threads of execution stomping on your memory-space (except when explicitly invited to by arranged shared-memory areas). It's an easier model to control side-effects in. If this is too expensive, and threads in the same situation will work faster, than I might reluctantly use them instead.
I'm not aware of any program, on any system, that spawns a new process on each event it wants to handle concurrently;
inetd
systems that don't use an existing user-space thread library (such as Concurrent Haskell or libthread [1]) emulate user-space threads by keeping a pool of processors and re-using them (e.g., IIUC Apache does this).
Your response seems to be yet another argument that processes are too expensive to be used the same way as threads. In my mind pooling vs new-creation is only relevant to process vs thread in the performance aspects. The fact that people use thread-pools means that they think that even thread-creation is too expensive. The central aspect in my mind is a default share-everything, or default share-nothing. One is much easier to reason about and encourages writing systems that have less shared-memory contention. -- Aaron Denney -><-

On Wed, 2008-09-17 at 20:29 +0000, Aaron Denney wrote:
On 2008-09-17, Jonathan Cast
wrote: On Wed, 2008-09-17 at 18:40 +0000, Aaron Denney wrote:
On 2008-09-17, Arnar Birgisson
wrote: Hi Manlio and others,
On Wed, Sep 17, 2008 at 14:58, Manlio Perillo
wrote: http://www.heise-online.co.uk/open/Shuttleworth-Python-needs-to-focus-on-fut...
"cloud computing, transactional memory and future multicore processors"
Multicore support is already "supported" in Python, if you use multiprocessing, instead of multithreading.
Well, I'm a huge Python fan myself, but multiprocessing is not really a solution as much as it is a workaround. Python as a language has no problem with multithreading and multicore support and has all primitives to do conventional shared-state parallelism. However, the most popular /implementation/ of Python sacrifies this for performance, it has nothing to do with the language itself.
Huh. I see multi-threading as a workaround for expensive processes, which can explicitly use shared memory when that makes sense.
That breaks down when you want 1000s of threads.
This really misses the point I was going for. I don't want 1000s of threads. I don't want *any* threads. Processes are nice because you don't have other threads of execution stomping on your memory-space (except when explicitly invited to by arranged shared-memory areas). It's an easier model to control side-effects in. If this is too expensive, and threads in the same situation will work faster, than I might reluctantly use them instead.
I'm not aware of any program, on any system, that spawns a new process on each event it wants to handle concurrently;
inetd
OK. But inetd calls exec on each event, too, so I think it's somewhat orthogonal to this issue. (Multi-processing is great if you want to compose programs; the question is how you parallelize concurrent instances of the same program).
systems that don't use an existing user-space thread library (such as Concurrent Haskell or libthread [1]) emulate user-space threads by keeping a pool of processors and re-using them (e.g., IIUC Apache does this).
Your response seems to be yet another argument that processes are too expensive to be used the same way as threads.
You mean `is'. And they are. Would you write a web server run out of inetd? (Would its multi-processor support make it out-perform a single-threaded web server in the same language? I doubt it.) HWS, on the other hand, spawns a new Concurrent Haskell thread on every request.
In my mind pooling vs new-creation is only relevant to process vs thread in the performance aspects.
Say what? This discussion is entirely about performance --- does CPython actually have the ability to scale concurrent programs to multiple processors? The only reason you would ever want to do that is for performance.
The fact that people use thread-pools
I don't think people use thread-pools with Concurrent Haskell, or with libthread.
means that they think that even thread-creation is too expensive.
Kernel threads /are/ expensive. Which is why all the cool kids use user-space threads.
The central aspect in my mind is a default share-everything, or default share-nothing.
I really don't think you understand Concurrent Haskell, then. (Or Concurrent ML, or stackless Python, or libthread, or any other CSP-based set-up). jcc

On 2008-09-17, Jonathan Cast
In my mind pooling vs new-creation is only relevant to process vs thread in the performance aspects.
Say what? This discussion is entirely about performance --- does CPython actually have the ability to scale concurrent programs to multiple processors? The only reason you would ever want to do that is for performance.
I entered the discussion as which model is a workaround for the other -- someone said processes were a workaround for the lack of good threading in e.g. standard CPython. I replied that most languages thread support can be seen as a workaround for the poor performance of communicating processes. (creation in particular is usually cited, but that cost can often be reduced by process pools, context switching costs, alas, is harder.)
Kernel threads /are/ expensive. Which is why all the cool kids use user-space threads.
Often muxed on top of kernel threads, because user-threads can't use multiple CPUs at once.
The central aspect in my mind is a default share-everything, or default share-nothing.
I really don't think you understand Concurrent Haskell, then. (Or Concurrent ML, or stackless Python, or libthread, or any other CSP-based set-up).
Or Erlang, Occam, or heck, even jcsp. Because I'm coming at this from a slightly different perspective and place a different emphasis on things you think I don't understand? No, trust me, I do understand them[1], and think CSP and actor models (the differences in nondeterminism is a minor detail that doesn't much matter here) are extremely nice ways of implementing parallel systems. These are, in fact, process models. They are implemented on top of thread models, but that's a performance hack. And while putting this model on top restores much of the programming sanity, in languages with mutable variables and references that can be passed, you still need a fair bit of discipline to keep that sanity. There, the implementation detail of thread, rather than process allows and even encourages shortcuts that violate the process model. In languages that are immutable, taking advantage of the shared memory space really can gain efficiency without any noticeably downside. [1] I used to work for a company designing and modeling CSP-based hardware designs. In my spare time I started writing a compiler from our HDL to Concurrent Haskell, but abandoned it when I left for grad school. -- Aaron Denney -><-

Hi Aaron,
On Wed, Sep 17, 2008 at 23:20, Aaron Denney
I entered the discussion as which model is a workaround for the other -- someone said processes were a workaround for the lack of good threading in e.g. standard CPython. I replied that most languages thread support can be seen as a workaround for the poor performance of communicating processes. (creation in particular is usually cited, but that cost can often be reduced by process pools, context switching costs, alas, is harder.)
That someone was probably me, but this is not what I meant. I meant that the "processing" [1] Python module is a workaround for CPython's performance problems with threads. For those who don't know it, the processing module exposes a nearly identical interface to the standard threading module in Python, but runs each "thread" in a seperate OS process. The processing module emulates shared memory between these "threads" as well as locking primitives and blocking. That is what I meant when I said "processing" (the module) was a workaround for CPython's threading issues. [1] http://www.python.org/dev/peps/pep-0371/ The processes vs. threads depends on definitions. There seem to be two sets floating around here. One is that processes and threads are essentially the same, the only difference being that processes don't share memory but threads do. With this view it doesn't really matter if "processes" are implemented as proper OS processes or OS threads. Discussion based on this definition can be interesting and one model fits some problems better than the other and vice versa. The other one is the systems view of OS processes vs. OS threads. Discussion about the difference between these two is only mildly interesting imo, as I think most people agree on things here and they are well covered in textbooks that are old as dirt.
The central aspect in my mind is a default share-everything, or default share-nothing.
[..snip...] These are, in fact, process models. They are implemented on top of thread models, but that's a performance hack. And while putting this model on top restores much of the programming sanity, in languages with mutable variables and references that can be passed, you still need a fair bit of discipline to keep that sanity. There, the implementation detail of thread, rather than process allows and even encourages shortcuts that violate the process model.
Well, this is a viewpoint I don't totally agree with. Correct me if I'm not understanding you, but you seem to be making the point that OS processes are often preferred because with threads, you *can* get yourself in trouble by using shared memory. The thing I don't agree with is "let's use A because B has dangerous features". This is sort of like the design mantra of languages like Java. Now, you may say that indeed Java has been wildly successful, but I think (or hope) that is because we don't give people (programmers) enough credit. Literature, culture and training in the current practice of programming could do well IMO with making fewer _good_ programmers rather than a lot of mediocre ones. And _good_ programmers don't need to be handcuffed just because otherwise they *could* poke themselves in the eye. I.e. if you need to sacrifice the efficiency of threads for full-blown OS processes because people can't stay away from shared memory, then something is fundamentally wrong. I'll stop here, this is starting to sound like a very OT rant. cheers, Arnar

On Wed, 2008-09-17 at 23:42 +0200, Arnar Birgisson wrote:
On Wed, Sep 17, 2008 at 23:20, Aaron Denney
wrote: The central aspect in my mind is a default share-everything, or default share-nothing.
[..snip...] These are, in fact, process models. They are implemented on top of thread models, but that's a performance hack. And while putting this model on top restores much of the programming sanity, in languages with mutable variables and references that can be passed, you still need a fair bit of discipline to keep that sanity. There, the implementation detail of thread, rather than process allows and even encourages shortcuts that violate the process model.
Well, this is a viewpoint I don't totally agree with. Correct me if I'm not understanding you, but you seem to be making the point that OS processes are often preferred because with threads, you *can* get yourself in trouble by using shared memory.
The thing I don't agree with is "let's use A because B has dangerous features". This is sort of like the design mantra of languages like Java.
Or Haskell. jcc

On 2008-09-17, Arnar Birgisson
Hi Aaron,
On Wed, Sep 17, 2008 at 23:20, Aaron Denney
wrote: I entered the discussion as which model is a workaround for the other -- someone said processes were a workaround for the lack of good threading in e.g. standard CPython. I replied that most languages thread support can be seen as a workaround for the poor performance of communicating processes. (creation in particular is usually cited, but that cost can often be reduced by process pools, context switching costs, alas, is harder.)
That someone was probably me, but this is not what I meant. I meant that the "processing" [1] Python module is a workaround for CPython's performance problems with threads.
Ah, on rereading that's much clearer. Thank you for the clarification.
The processes vs. threads depends on definitions. There seem to be two sets floating around here. One is that processes and threads are essentially the same, the only difference being that processes don't share memory but threads do. With this view it doesn't really matter if "processes" are implemented as proper OS processes or OS threads. Discussion based on this definition can be interesting and one model fits some problems better than the other and vice versa.
The other one is the systems view of OS processes vs. OS threads. Discussion about the difference between these two is only mildly interesting imo, as I think most people agree on things here and they are well covered in textbooks that are old as dirt.
I think from the OS point of view, these two distinctions are nearly equivalent. There is only a difference when you start talking about non-OS threads, such as those provided by various language runtimes.
There, the implementation detail of thread, rather than process allows and even encourages shortcuts that violate the process model.
Well, this is a viewpoint I don't totally agree with. Correct me if I'm not understanding you, but you seem to be making the point that OS processes are often preferred because with threads, you *can* get yourself in trouble by using shared memory.
That's exactly it. Or rather, you can get in exactly as much trouble with processes, but because accessing a variable in another memory space is more cumbersome, you have to actually think when you do so. Looking at all uses of "a = b" to see what invariants might be broken is unfeasible. Looking at all requests for updating a remote variable might be manageable. -- Aaron Denney -><-

On Wed, 2008-09-17 at 21:20 +0000, Aaron Denney wrote:
On 2008-09-17, Jonathan Cast
wrote: In my mind pooling vs new-creation is only relevant to process vs thread in the performance aspects.
Say what? This discussion is entirely about performance --- does CPython actually have the ability to scale concurrent programs to multiple processors? The only reason you would ever want to do that is for performance.
I entered the discussion as which model is a workaround for the other --
Well, I thought the discussion was about implementations, not models. I also assumed remarks would be made in the context of the entire thread. I shall have to remember that in the future.
someone said processes were a workaround for the lack of good threading in e.g. standard CPython.
I replied that most languages thread support
Using a definition of `thread' which, apparantly, excludes Concurrent Haskell.
can be seen as a workaround for the poor performance of communicating processes.
Meaning kernel-switched processes.
(creation in particular is usually cited, but that cost can often be reduced by process pools, context switching costs, alas, is harder.)
Kernel threads /are/ expensive. Which is why all the cool kids use user-space threads.
Often muxed on top of kernel threads, because user-threads can't use multiple CPUs at once.
Well, a single kernel thread can't use multiple CPUs at once. (So you need more than one).
The central aspect in my mind is a default share-everything, or default share-nothing.
I really don't think you understand Concurrent Haskell, then. (Or Concurrent ML, or stackless Python, or libthread, or any other CSP-based set-up).
Or Erlang, Occam, or heck, even jcsp. Because I'm coming at this from a slightly different perspective
Different enough we're talking past each other. The idea that the thing you make with forkIO doesn't count as a thread never crossed my mind, sorry.
and place a different emphasis on things
and use completely different definitions for key terms and make statements which, substituting in the definitions I was using, are (as I hope you grant) non-sensical
you think I don't understand?
Not any more. I just think your definition of `thread' is unexpected in this context (without rather more elaboration).
No, trust me, I do understand them[1], and think CSP and actor models (the differences in nondeterminism is a minor detail that doesn't much matter here) are extremely nice ways of implementing parallel systems.
I'm glad to hear that...
These are, in fact, process models.
OK. I think that perspective is rather unique, but OK.
They are implemented on top of thread models, but that's a performance hack.
Maybe. It's done for performance, but I don't see why you call it a hack. Does it sacrifice some important advantage I'm missing? (Vs. kernel-scheduled threads).
And while putting this model on top restores much of the programming sanity, in languages with mutable variables and references that can be passed, you still need a fair bit of discipline to keep that sanity. There, the implementation detail of thread, rather than process allows and even encourages shortcuts that violate the process model. In languages that are immutable, taking advantage of the shared memory space really can gain efficiency without any noticeably downside.
Nice clarification.[1] Thanks. jcc [1] I am, btw., painfully aware that Haskell has mutable references that can be passed between threads. Just as I am painfully aware of Unix's, um, interesting ideas on maintaining file system consistency in the presence of concurrent access to *that* shared resource...

On 2008-09-17, Jonathan Cast
On Wed, 2008-09-17 at 21:20 +0000, Aaron Denney wrote:
On 2008-09-17, Jonathan Cast
wrote: In my mind pooling vs new-creation is only relevant to process vs thread in the performance aspects.
Say what? This discussion is entirely about performance --- does CPython actually have the ability to scale concurrent programs to multiple processors? The only reason you would ever want to do that is for performance.
I entered the discussion as which model is a workaround for the other --
Well, I thought the discussion was about implementations, not models. I also assumed remarks would be made in the context of the entire thread. I shall have to remember that in the future.
someone said processes were a workaround for the lack of good threading in e.g. standard CPython.
I replied that most languages thread support
Using a definition of `thread' which, apparantly, excludes Concurrent Haskell.
Can't I exclude it based on "most languages'". CSP models are still the minority.
Different enough we're talking past each other. The idea that the thing you make with forkIO doesn't count as a thread never crossed my mind, sorry.
I think it's fair to consider it a thread interface, because there's still a huge amount of shared state. Mostly immutable, but not completely as you later point out, even discounting updates of lazy-evaluation thunks. It is a lot less pure CSP than Erlang and Occam, which both call them processes (though I see "thread" being used more and more these days for Erlang). Then there's apparently a tradition in mainstream languages of calling language-level parallelism "threads". Of course most are thread models.
and use completely different definitions for key terms and make statements which, substituting in the definitions I was using, are (as I hope you grant) non-sensical
Yes, I can see how my rants sounded bizarre, even though I think we're mostly in agreement.
Not any more. I just think your definition of `thread' is unexpected in this context (without rather more elaboration).
These are, in fact, process models.
OK. I think that perspective is rather unique, but OK.
Well, what's the P in CSP stand for?
They are implemented on top of thread models, but that's a performance hack.
Maybe. It's done for performance, but I don't see why you call it a hack. Does it sacrifice some important advantage I'm missing? (Vs. kernel-scheduled threads).
Vs kernel threads, not much -- just parallelism on SMP systems, which is often regained by muxing on top of kernel threads. Vs kernel processes, yes, I think some is lost. Privilege separation, isolation in the event of crashes, larger memory spaces, the ability to span multiple machines (necessary for true fault tolerance). How important are these vs raw speed? Well, it depends on the domain and problem. Take postfix for instance -- different parts of postfix are implemented in different processes, with different OS privileges. Subverting one doesn't give you carte blanche with the others, as it would if these were all threads in one process. -- Aaron Denney -><-

Aaron Denney wrote:
On 2008-09-17, Jonathan Cast
wrote: In my mind pooling vs new-creation is only relevant to process vs thread in the performance aspects.
Say what? This discussion is entirely about performance --- does CPython actually have the ability to scale concurrent programs to multiple processors? The only reason you would ever want to do that is for performance.
I entered the discussion as which model is a workaround for the other -- someone said processes were a workaround for the lack of good threading in e.g. standard CPython. I replied that most languages thread support can be seen as a workaround for the poor performance of communicating processes. (creation in particular is usually cited, but that cost can often be reduced by process pools, context switching costs, alas, is harder.)
Kernel threads /are/ expensive. Which is why all the cool kids use user-space threads.
You must love Coyotos, then (http://www.coyotos.org/), which (IIRC) allows just that (via so called 'scheduler activations', see http://www.cs.washington.edu/homes/bershad/Papers/p53-anderson.pdf) Cheers Ben

jonathanccast:
The fact that people use thread-pools
I don't think people use thread-pools with Concurrent Haskell, or with libthread.
Sure. A Chan with N worker forkIO threads taking jobs from a queue is a useful idiom I've employed on occasion. -- Don

systems that don't use an existing user-space thread library (such as Concurrent Haskell or libthread [1]) emulate user-space threads by keeping a pool of processors and re-using them (e.g., IIUC Apache does this).
Your response seems to be yet another argument that processes are too expensive to be used the same way as threads. In my mind pooling vs new-creation is only relevant to process vs thread in the performance aspects. The fact that people use thread-pools means that they think that even thread-creation is too expensive. The central aspect in my mind is a default share-everything, or default share-nothing. One is much easier to reason about and encourages writing systems that have less shared-memory contention.
This is similar to the plan9 conception of processes. You have a generic rfork() call that takes flags that say what to share with your parent: namespace, environment, heap, etc. Thus the only difference between a thread and a process is different flags to rfork(). Under the covers, I believe linux is similar, with its clone() call. The fast context switching part seems orthogonal to me. Why is it that getting the OS involved for context switches kills the performance? Is it that the ghc RTS can switch faster because it knows more about the code it's running (i.e. the OS obviously couldn't switch on memory allocations like that)? Or is jumping up to kernel space somehow expensive by nature? And why does the OS need so many more K to keep track of a thread than the RTS? I don't really know much about either OSes or language runtimes so this is interesting to me.

On 2008 Sep 17, at 16:44, Evan Laforge wrote:
The fast context switching part seems orthogonal to me. Why is it that getting the OS involved for context switches kills the performance? Is it that the ghc RTS can switch faster because it knows more about the code it's running (i.e. the OS obviously couldn't switch on memory allocations like that)? Or is jumping up to kernel space somehow expensive by nature? And why does the OS need so many
A context switch involving the OS is actually a double (at least) context switch: one to switch to kernel context, another to switch back to user context, and (because kernel context switches are scheduler entry points) optionally switches to other processes which have a higher immediate priority. These context switches also switch considerably more state than a user-mode context switch between green threads, which doesn't switch the full process context including the set of process page tables, processor access controls, etc. -- brandon s. allbery [solaris,freebsd,perl,pugs,haskell] allbery@kf8nh.com system administrator [openafs,heimdal,too many hats] allbery@ece.cmu.edu electrical and computer engineering, carnegie mellon university KF8NH

On Wed, 2008-09-17 at 13:44 -0700, Evan Laforge wrote:
systems that don't use an existing user-space thread library (such as Concurrent Haskell or libthread [1]) emulate user-space threads by keeping a pool of processors and re-using them (e.g., IIUC Apache does this).
Your response seems to be yet another argument that processes are too expensive to be used the same way as threads. In my mind pooling vs new-creation is only relevant to process vs thread in the performance aspects. The fact that people use thread-pools means that they think that even thread-creation is too expensive. The central aspect in my mind is a default share-everything, or default share-nothing. One is much easier to reason about and encourages writing systems that have less shared-memory contention.
This is similar to the plan9 conception of processes. You have a generic rfork() call that takes flags that say what to share with your parent: namespace, environment, heap, etc. Thus the only difference between a thread and a process is different flags to rfork().
As I mentioned, Plan 9 also has a user-space thread library, similar to Concurrent Haskell.
Under the covers, I believe linux is similar, with its clone() call.
The fast context switching part seems orthogonal to me. Why is it that getting the OS involved for context switches kills the performance?
Read about CPU architecture.
Is it that the ghc RTS can switch faster because it knows more about the code it's running (i.e. the OS obviously couldn't switch on memory allocations like that)? Or is jumping up to kernel space somehow expensive by nature?
Yes. Kernel code is very different on the bare metal from userspace code; RTS code of course is not at all different. Switching processes in the kernel requires an interrupt or a system call. Both of those require the processor to dump the running process's state so it can be restored later (userspace thread-switching does the same thing, but it doesn't dump as much state because it doesn't need to be as conservative about what it saves).
And why does the OS need so many more K to keep track of a thread than the RTS?
An OS thread (Linux/Plan 9) stores: * Stack (definitely a stack pointer and stored registers (> 40 bytes on i686) and includes a special set of page tables on Plan 9) * FD set (even if it's the same as the parent thread, you need to keep a pointer to it * uid/euid/gid/egid (Plan 9 I think omits euid and egid) * Namespace (Plan 9 only; again, you need at least a pointer even if it's the same as the parent process) * Priority * Possibly other things I can't think of right now A Concurrent Haskell thread stores: * Stack * Allocation area (4KB) The kernel offers more to a process (and offers a wider separation between processes) than Concurrent Haskell offers to a thread. jcc

Jonathan Cast wrote:
On Wed, 2008-09-17 at 13:44 -0700, Evan Laforge wrote:
systems that don't use an existing user-space thread library (such as Concurrent Haskell or libthread [1]) emulate user-space threads by keeping a pool of processors and re-using them (e.g., IIUC Apache does this). Your response seems to be yet another argument that processes are too expensive to be used the same way as threads. In my mind pooling vs new-creation is only relevant to process vs thread in the performance aspects. The fact that people use thread-pools means that they think that even thread-creation is too expensive. The central aspect in my mind is a default share-everything, or default share-nothing. One is much easier to reason about and encourages writing systems that have less shared-memory contention. This is similar to the plan9 conception of processes. You have a generic rfork() call that takes flags that say what to share with your parent: namespace, environment, heap, etc. Thus the only difference between a thread and a process is different flags to rfork().
As I mentioned, Plan 9 also has a user-space thread library, similar to Concurrent Haskell.
Under the covers, I believe linux is similar, with its clone() call.
The fast context switching part seems orthogonal to me. Why is it that getting the OS involved for context switches kills the performance?
Read about CPU architecture.
Is it that the ghc RTS can switch faster because it knows more about the code it's running (i.e. the OS obviously couldn't switch on memory allocations like that)? Or is jumping up to kernel space somehow expensive by nature?
Yes. Kernel code is very different on the bare metal from userspace code; RTS code of course is not at all different. Switching processes in the kernel requires an interrupt or a system call. Both of those require the processor to dump the running process's state so it can be restored later (userspace thread-switching does the same thing, but it doesn't dump as much state because it doesn't need to be as conservative about what it saves).
And why does the OS need so many more K to keep track of a thread than the RTS?
An OS thread (Linux/Plan 9) stores:
* Stack (definitely a stack pointer and stored registers (> 40 bytes on i686) and includes a special set of page tables on Plan 9) * FD set (even if it's the same as the parent thread, you need to keep a pointer to it * uid/euid/gid/egid (Plan 9 I think omits euid and egid) * Namespace (Plan 9 only; again, you need at least a pointer even if it's the same as the parent process) * Priority * Possibly other things I can't think of right now
A Concurrent Haskell thread stores:
* Stack * Allocation area (4KB)
Allocation areas are per-CPU, not per-thread. A Concurrent Haskell thread consists of a TSO (thread state object, currently 11 machine words), and a stack, which we currently start with 1KB and grow on demand. Cheers, Simon

On Thu, 2008-09-18 at 10:33 +0100, Simon Marlow wrote:
Jonathan Cast wrote:
An OS thread (Linux/Plan 9) stores:
* Stack (definitely a stack pointer and stored registers (> 40 bytes on i686) and includes a special set of page tables on Plan 9) * FD set (even if it's the same as the parent thread, you need to keep a pointer to it * uid/euid/gid/egid (Plan 9 I think omits euid and egid) * Namespace (Plan 9 only; again, you need at least a pointer even if it's the same as the parent process) * Priority * Possibly other things I can't think of right now
A Concurrent Haskell thread stores:
* Stack * Allocation area (4KB)
Allocation areas are per-CPU, not per-thread.
Didn't know/didn't think through that. Thanks! jcc

Simon Marlow ha scritto:
Jonathan Cast wrote:
On Wed, 2008-09-17 at 13:44 -0700, Evan Laforge wrote:
systems that don't use an existing user-space thread library (such as Concurrent Haskell or libthread [1]) emulate user-space threads by keeping a pool of processors and re-using them (e.g., IIUC Apache does this). Your response seems to be yet another argument that processes are too expensive to be used the same way as threads. In my mind pooling vs new-creation is only relevant to process vs thread in the performance aspects. The fact that people use thread-pools means that they think that even thread-creation is too expensive. The central aspect in my mind is a default share-everything, or default share-nothing. One is much easier to reason about and encourages writing systems that have less shared-memory contention. This is similar to the plan9 conception of processes. You have a generic rfork() call that takes flags that say what to share with your parent: namespace, environment, heap, etc. Thus the only difference between a thread and a process is different flags to rfork().
As I mentioned, Plan 9 also has a user-space thread library, similar to Concurrent Haskell.
Under the covers, I believe linux is similar, with its clone() call.
The fast context switching part seems orthogonal to me. Why is it that getting the OS involved for context switches kills the performance?
Read about CPU architecture.
Is it that the ghc RTS can switch faster because it knows more about the code it's running (i.e. the OS obviously couldn't switch on memory allocations like that)? Or is jumping up to kernel space somehow expensive by nature?
Yes. Kernel code is very different on the bare metal from userspace code; RTS code of course is not at all different. Switching processes in the kernel requires an interrupt or a system call. Both of those require the processor to dump the running process's state so it can be restored later (userspace thread-switching does the same thing, but it doesn't dump as much state because it doesn't need to be as conservative about what it saves).
And why does the OS need so many more K to keep track of a thread than the RTS?
An OS thread (Linux/Plan 9) stores:
* Stack (definitely a stack pointer and stored registers (> 40 bytes on i686) and includes a special set of page tables on Plan 9) * FD set (even if it's the same as the parent thread, you need to keep a pointer to it * uid/euid/gid/egid (Plan 9 I think omits euid and egid) * Namespace (Plan 9 only; again, you need at least a pointer even if it's the same as the parent process) * Priority * Possibly other things I can't think of right now
A Concurrent Haskell thread stores:
* Stack * Allocation area (4KB)
Allocation areas are per-CPU, not per-thread. A Concurrent Haskell thread consists of a TSO (thread state object, currently 11 machine words), and a stack, which we currently start with 1KB and grow on demand.
How is this implemented? I have seen some coroutine implementations in C, using functions from ucontext.h (or direct asm code), but all have the problem that the allocated stack is fixed. Thanks Manlio Perillo

On Sep 18, 2008, at 15:10 , Manlio Perillo wrote:
Allocation areas are per-CPU, not per-thread. A Concurrent Haskell thread consists of a TSO (thread state object, currently 11 machine words), and a stack, which we currently start with 1KB and grow on demand.
How is this implemented?
I have seen some coroutine implementations in C, using functions from ucontext.h (or direct asm code), but all have the problem that the allocated stack is fixed.
That's because it's much easier to use a fixed stack. There are two ways to handle a growable stack; both start with allocating each stack in a separate part of the address space with room to grow it downward. The simpler way uses stack probes on function entry to detect impending stack overflow. The harder (and less portable) one involves trapping page faults ("segmentation violation" on POSIX), enlarging the stack, and restarting the instruction that caused the trap; this requires fairly detailed knowledge of the CPU and the way signals or page faults are handled by the OS. (There's also a hybrid which many POSIXish systems use, trapping the page fault specifically when running the stack probe; the probe is designed to be safe to either restart or ignore, so it can be handled more portably.) -- brandon s. allbery [solaris,freebsd,perl,pugs,haskell] allbery@kf8nh.com system administrator [openafs,heimdal,too many hats] allbery@ece.cmu.edu electrical and computer engineering, carnegie mellon university KF8NH

Brandon S. Allbery KF8NH ha scritto:
On Sep 18, 2008, at 15:10 , Manlio Perillo wrote:
Allocation areas are per-CPU, not per-thread. A Concurrent Haskell thread consists of a TSO (thread state object, currently 11 machine words), and a stack, which we currently start with 1KB and grow on demand.
How is this implemented?
I have seen some coroutine implementations in C, using functions from ucontext.h (or direct asm code), but all have the problem that the allocated stack is fixed.
That's because it's much easier to use a fixed stack.
There are two ways to handle a growable stack; both start with allocating each stack in a separate part of the address space with room to grow it downward. The simpler way uses stack probes on function entry to detect impending stack overflow. The harder (and less portable) one involves trapping page faults ("segmentation violation" on POSIX), enlarging the stack, and restarting the instruction that caused the trap; this requires fairly detailed knowledge of the CPU and the way signals or page faults are handled by the OS. (There's also a hybrid which many POSIXish systems use, trapping the page fault specifically when running the stack probe; the probe is designed to be safe to either restart or ignore, so it can be handled more portably.)
What implementation is used in GHC? Is this more easy to implement with a pure functional language like Haskell, or the same implementation can be used with a procedural language like C? Thanks Manlio

On 2008 Sep 19, at 17:14, Manlio Perillo wrote:
Brandon S. Allbery KF8NH ha scritto:
There are two ways to handle a growable stack; both start with allocating each stack in a separate part of the address space with room to grow it downward. The simpler way uses stack probes on function entry to detect impending stack overflow. The harder (and less portable) one involves trapping page faults ("segmentation violation" on POSIX), enlarging the stack, and restarting the instruction that caused the trap; this requires fairly detailed knowledge of the CPU and the way signals or page faults are handled by the OS. (There's also a hybrid which many POSIXish systems use, trapping the page fault specifically when running the stack probe; the probe is designed to be safe to either restart or ignore, so it can be handled more portably.)
What implementation is used in GHC?
I haven't looked at the GHC implementation.
Is this more easy to implement with a pure functional language like Haskell, or the same implementation can be used with a procedural language like C?
You can use it with pretty much any language, as long as you can limit the size of stack frames. (If a stack frame is larger than the stack probe distance you might just get an unhandled page fault.) The main question is whether you ant to absorb the additional complexity; it's a bit harder to debug memory issues when you have to deal with page faults yourself. (A *real* segmentation violation might be hidden by the stack grow code.) -- brandon s. allbery [solaris,freebsd,perl,pugs,haskell] allbery@kf8nh.com system administrator [openafs,heimdal,too many hats] allbery@ece.cmu.edu electrical and computer engineering, carnegie mellon university KF8NH

Jonathan Cast ha scritto:
[...]
Huh. I see multi-threading as a workaround for expensive processes, which can explicitly use shared memory when that makes sense.
That breaks down when you want 1000s of threads. I'm not aware of any program, on any system, that spawns a new process on each event it wants to handle concurrently;
thttpd spawn a new process for every CGI request or directory listing. Many "old" server don't use a prefork model. Another interesting example for bad threads usage is Azureus (a BitTorrent client written in Java): it creates about 70 threads for who know what reasons (and it also allocate about one half of available virtual memory).
[...]
Manlio Perillo

Multiprocessing is hardly a solution... I realize the Python
interpreter's fairly lightweight on its own, but the weight of a full
unix process plus the weight of the python interpreter in terms of
memory, context switching times, and finally the clunkiness of the
fork() model (which is HOW many years old now?). They need a model
programmers are familiar with, e.g. threads-allocate-to-cores a la
Java or C, or they need a model that is entirely new or is based on
source-code annotation (like Strategies and Control.Parallel).
On Wed, Sep 17, 2008 at 8:58 AM, Manlio Perillo
Don Stewart ha scritto:
http://www.heise-online.co.uk/open/Shuttleworth-Python-needs-to-focus-on-fut...
"cloud computing, transactional memory and future multicore processors"
Multicore support is already "supported" in Python, if you use multiprocessing, instead of multithreading.
And scalability is not a "real" problem, if you write RESTful web applications.
Get writing that multicore, STM, web app code!
Manlio Perillo _______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
-- I try to take things like a crow; war and chaos don't always ruin a picnic, they just mean you have to be careful what you swallow. -- Jessica Edwards

Jefferson Heard ha scritto:
Multiprocessing is hardly a solution... I realize the Python interpreter's fairly lightweight on its own, but the weight of a full unix process plus the weight of the python interpreter in terms of memory,
With copy on write some memory can be saved (if you preload all the required modules in the master process).
context switching times,
That's probabily the same as thread switching time. And if you use asynchronous programming in each of the worker processes, you can keep the number of required processes at a minimum.
and finally the clunkiness of the fork() model (which is HOW many years old now?).
Old does not means bad, IMHO.
[...]
Manlio Perillo

On Wed, 17 Sep 2008, Manlio Perillo wrote:
Jefferson Heard ha scritto:
the weight of a full unix process plus the weight of the python interpreter in terms of memory,
With copy on write some memory can be saved (if you preload all the required modules in the master process).
The kernel data structures and writable pages supporting an OS-level process add up to dozens of KB whereas a language-level concurrent context should require about 1KB.
context switching times,
That's probabily the same as thread switching time.
Competent language-level concurrency support (as in Haskell and Erlang)
makes a context switch about as expensive as a function call, thousands of
times faster than an OS-level process switch.
Tony.
--
f.anthony.n.finch

Tony Finch ha scritto:
[...]
context switching times, That's probabily the same as thread switching time.
Competent language-level concurrency support (as in Haskell and Erlang) makes a context switch about as expensive as a function call, thousands of times faster than an OS-level process switch.
I know. But when using the term "thread" one usually assume kernel thread. Of course if we talk about user threads it's a whole new story.
Tony.
Manlio Perillo

jason.dusek:
What does Haskell have to say about cloud computing?
I'm not sure cloud computing is well-enough defined to say anything yet. "paradigm in which information is permanently stored in servers on the Internet and cached temporarily on clients that include desktops, entertainment centers, table computers, notebooks, wall computers, handhelds, etc." So we're talking about JSON, online db services like Amazon bindings, HAppS nodes, et al. For which Haskell's perfectly able. Now, maybe there's some nice abstractions waiting to be found though.. -- Don

Don Stewart
jason.dusek:
What does Haskell have to say about cloud computing?
I'm not sure cloud computing is well-enough defined to say anything yet.
That is fair -- having something to say about cloud computing is essentially having a grand vision. I only ask because it was touched on in the original message.
...we're talking about JSON, online db services like Amazon bindings, HAppS nodes, et al. For which Haskell's perfectly able.
Do HAppS nodes really function as nodes in a larger system? Does HAppS function as a "cluster application server"?
Now, maybe there's some nice abstractions waiting to be found though...
Conventionally, it is argued that the abstraction of choice is message passing; but that isn't going to take you anywhere near having a web page that people can see twice without some more abstraction. I would like to say that distributed version control is that abstraction -- that branches with a main trunk are a model for resources that is compatible with dirty-write as well as consistent read. However, as systems become more desirable from a maintenance point of view -- self-healing, easily expandable, fault tolerant -- it becomes ever more difficult to get the transactionality you need to have a main trunk. -- _jsn

jason.dusek:
What does Haskell have to say about cloud computing?
If by 'cloud computing' you wish to discuss mapReduce then: http://www.cs.vu.nl/~ralf/MapReduce/paper.pdf Map reduce in Haskell, enjoy! Tom
participants (15)
-
Aaron Denney
-
Arnar Birgisson
-
Ben Franksen
-
Brandon S. Allbery KF8NH
-
Bruce Eckel
-
Don Stewart
-
Evan Laforge
-
Jason Dusek
-
Jefferson Heard
-
Jonathan Cast
-
Lionel Barret De Nazaris
-
Manlio Perillo
-
Simon Marlow
-
Thomas M. DuBuisson
-
Tony Finch