
#447: do parallel builds ---------------------------------+------------------------------------------ Reporter: duncan | Owner: Type: enhancement | Status: new Priority: normal | Milestone: Component: cabal-install tool | Version: Severity: normal | Keywords: Difficulty: normal | Ghcversion: 6.8.3 Platform: | ---------------------------------+------------------------------------------ The latest version of the gentoo portage tool is rather slick. It can do parallel builds and it displays a nice summary on the command line, eg: {{{ # emerge -uD system -j --load-average=4.5 Calculating dependencies... done!
Verifying ebuild manifests Starting parallel fetch Emerging (1 of 14) dev-libs/expat-2.0.1-r1 Emerging (2 of 14) sys-devel/autoconf-wrapper-6 Emerging (3 of 14) sys-kernel/linux-headers-2.6.27-r2 Installing sys-devel/autoconf-wrapper-6 Jobs: 0 of 14 complete, 1 running Load avg: 2.99, 1.59, 0.67 }}}
Note how they solve the problem of how to display what is going on when there are multiple builds happening. The answer is not to display it at all! This would have to go hand-in-hand with logging all builds so that we can still diagnose failures. Note the final line, that gets updated to display the current number of jobs running, the number completed etc. It also shows the load average. The job scheduler has two parameters, one is a maximum number of jobs (or unlimited) and the other is a load average. It will only launch new jobs if the load average is less than the given maximum. That allows it to interact reasonably well with builds that use `make -j` internally. In the example above I set the load average to be just slightly more than the number of CPUs I've got. It looks to me like it serialises some bits, like installing, since saturating the disk with multiple parallel installs is generally of no benefit, indeed it can be slower. Also downloads seem to be serialised, again because there is probably little benefit to making multiple connections to the same server. Anyway, the point is, cabal-install ought to be able to do all this. Some bits we can do now. We already have a graph representation of the install plan and we recalculate when a package fails to install. We will need an improved download api, probably involving sending requests off to a dedicated download thread (which would serialise them). -- Ticket URL: http://hackage.haskell.org/trac/hackage/ticket/447 Hackage http://haskell.org/cabal/ Hackage: Cabal and related projects

#447: build multiple packages in parallel ---------------------------------+------------------------------------------ Reporter: duncan | Owner: Type: enhancement | Status: new Priority: normal | Milestone: Component: cabal-install tool | Version: Severity: normal | Resolution: Keywords: | Difficulty: normal Ghcversion: 6.8.3 | Platform: ---------------------------------+------------------------------------------ Changes (by duncan): * summary: do parallel builds => build multiple packages in parallel -- Ticket URL: http://hackage.haskell.org/trac/hackage/ticket/447#comment:1 Hackage http://haskell.org/cabal/ Hackage: Cabal and related projects

On 2009-01-10 16:01 -0000 (Sat), dunacn wrote:
Also downloads seem to be serialised, again because there is probably little benefit to making multiple connections to the same server.
You evidently don't live in Japan. :-)
On high-bandwidth, high-latency links (e.g., Tokyo to a server in
NYC, which has well over 200 ms latency just due to the distance) TCP
doesn't work so well unless both sides are very well tuned for high
bandwidth-latency products. I'm generally on a 100 Mbps fibre connection
in Tokyo (it's the standard home or small office connection here), and
parallel downloads are a very happy thing; in some circumstances I can
run a dozen downloads in parallel, without any individual one being
slower than it would be running alone.
cjs
--
Curt Sampson

On Wed, 2009-01-14 at 00:09 +0900, Curt Sampson wrote:
On 2009-01-10 16:01 -0000 (Sat), dunacn wrote:
Also downloads seem to be serialised, again because there is probably little benefit to making multiple connections to the same server.
You evidently don't live in Japan. :-)
On high-bandwidth, high-latency links (e.g., Tokyo to a server in NYC, which has well over 200 ms latency just due to the distance) TCP doesn't work so well unless both sides are very well tuned for high bandwidth-latency products. I'm generally on a 100 Mbps fibre connection in Tokyo (it's the standard home or small office connection here), and parallel downloads are a very happy thing; in some circumstances I can run a dozen downloads in parallel, without any individual one being slower than it would be running alone.
Presumably to different servers though right? I was also under the impression that most web servers kind of frowned on more than one or two connections from the same client and some would take active measures to prevent it. Duncan

Among other potential issues, I think he's saying that the TCP window
size closes for any one connection (because the RTT is high, ACKs
don't come often enough). Therefore, multiple downloads (multiple TCP
connections) can provide a boost even when going to the same server.
I've heard of this before and at the time advocated SCTP - not sure if
that's a good option here.
Tom
On Sat, Jan 17, 2009 at 5:16 PM, Duncan Coutts
On Wed, 2009-01-14 at 00:09 +0900, Curt Sampson wrote:
On 2009-01-10 16:01 -0000 (Sat), dunacn wrote:
Also downloads seem to be serialised, again because there is probably little benefit to making multiple connections to the same server.
You evidently don't live in Japan. :-)
On high-bandwidth, high-latency links (e.g., Tokyo to a server in NYC, which has well over 200 ms latency just due to the distance) TCP doesn't work so well unless both sides are very well tuned for high bandwidth-latency products. I'm generally on a 100 Mbps fibre connection in Tokyo (it's the standard home or small office connection here), and parallel downloads are a very happy thing; in some circumstances I can run a dozen downloads in parallel, without any individual one being slower than it would be running alone.
Presumably to different servers though right?
I was also under the impression that most web servers kind of frowned on more than one or two connections from the same client and some would take active measures to prevent it.
Duncan
_______________________________________________ cabal-devel mailing list cabal-devel@haskell.org http://www.haskell.org/mailman/listinfo/cabal-devel

On 2009-01-17 19:26 +0000 (Sat), Thomas DuBuisson wrote:
I've heard of this before and at the time advocated SCTP - not sure if that's a good option here.
Oops...should have included this in the last reply.
Anyway, standard TCP extensions for Long Fat Pipes work well, and
are a great solution. They're not just implemented as broadly as one
would hope, yet, partically due to the usual network effects issue
(clients don't have them because many servers don't, and vice versa)
and partially because there's little demand in the U.S., the center
of the world for software development, because it's still sort of a
second-world as far as broadband deployment goes: most people still have
connections supporting only a few megabits per second or less.
(For comparison, in Japan, 45% of Internet users have fiber connections,
typically at 100 Mbps or 1 Gbps, and of the 42% beyond that that have
DSL, most of them are in the range of 12 Mbps to 50 Mbps. That means
that something like 80% of Japan's Internet users have a link at least
four times faster than the rate at which they can download from a U.S.
server without TCP extensions for LFPs.)
cjs
--
Curt Sampson

On 2009-01-17 17:16 +0000 (Sat), Duncan Coutts wrote:
On Wed, 2009-01-14 at 00:09 +0900, Curt Sampson wrote:
..parallel downloads are a very happy thing; in some circumstances I can run a dozen downloads in parallel, without any individual one being slower than it would be running alone.
Presumably to different servers though right?
No; to a single server. There still appear to be a lot of TCP implementations out there not supporting the extensions necessary to increase window size beyond 64K, meaning that at any time only 64K of unacknowledged data can be outstanding on the connection. For me, a 250 ms. round-trip-time (RTT) is not unusual, meaning that the minimum amount of time between the server sending a segment and getting my client's acknowledgement of receipt is 250 ms. That means that, no matter how much bandwidth is available, I'll never see more than 256 KB/sec, or so, which is a small fraction (1/40th) of the available bandwidth between me and a server in the US with at least a 100 Mbps connection.
I was also under the impression that most web servers kind of frowned on more than one or two connections from the same client and some would take active measures to prevent it.
Not that I'm aware of. Most have this capability, and some people chose
to use it when they it solves a problem for them, but I haven't found it
to be that terribly common.
And of course there's also the case where the files one downloads are
on different servers, either because the one "server" from which you're
downloading is actually a cluster (a common case with larger sites) or
because the things your downloading are actually hosted by different
entities.
cjs
--
Curt Sampson

On Sun, 2009-01-18 at 12:54 +0900, Curt Sampson wrote:
On 2009-01-17 17:16 +0000 (Sat), Duncan Coutts wrote:
On Wed, 2009-01-14 at 00:09 +0900, Curt Sampson wrote:
..parallel downloads are a very happy thing; in some circumstances I can run a dozen downloads in parallel, without any individual one being slower than it would be running alone.
Presumably to different servers though right?
No; to a single server. There still appear to be a lot of TCP implementations out there not supporting the extensions necessary to increase window size beyond 64K, meaning that at any time only 64K of unacknowledged data can be outstanding on the connection. For me, a 250 ms. round-trip-time (RTT) is not unusual, meaning that the minimum amount of time between the server sending a segment and getting my client's acknowledgement of receipt is 250 ms. That means that, no matter how much bandwidth is available, I'll never see more than 256 KB/sec, or so, which is a small fraction (1/40th) of the available bandwidth between me and a server in the US with at least a 100 Mbps connection.
I was also under the impression that most web servers kind of frowned on more than one or two connections from the same client and some would take active measures to prevent it.
Not that I'm aware of. Most have this capability, and some people chose to use it when they it solves a problem for them, but I haven't found it to be that terribly common.
And of course there's also the case where the files one downloads are on different servers, either because the one "server" from which you're downloading is actually a cluster (a common case with larger sites) or because the things your downloading are actually hosted by different entities.
Fair enough. I'm rather jealous. My ADSL is only 0.5Mbit/s :-( So we'd need to establish multiple Network.HTTP.Browser sessions (though with a way to control the maximum number of them for people with DSL like mine). Another thing to bear in mind for the redesign of the cabal-install download component. http://hackage.haskell.org/trac/hackage/ticket/448 Duncan
participants (4)
-
Curt Sampson
-
Duncan Coutts
-
Hackage
-
Thomas DuBuisson