homogeneous environments

View: New views
15 Messages — Rating Filter:   Alert me  

homogeneous environments

by Robert W. Anderson :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message


I have an environment where we have many nodes potentially available for
compilation, and all of them see the same file spaces via NFS.  We are
seeing decent performance out of distcc 3.1 using pump mode, but from
reading the docs there may be big performance gains left to wring out in
this special(?) case.

If I understand correctly, distcc's pump mode finds a set of header
files necessary to send along with the source file to enable compilation
on a remote node.  In a homogeneous environment, it seems both steps
here are unnecessary if the master and slave nodes are more or less
indistinguishable in terms of compiler, sources, and headers.

I think we could really achieve some screaming compile times (over
thousands of source files) if these steps could be bypassed with the
user's explicit acknowledgement that he is making assumptions about the
homogeneity of his build server machines.

How extensive would the modifications be to support such an
optimization?  It was not clear to me after a few minutes of poking
around in the source, and thought I'd seek an expert opinion first.

Thanks,
--
Robert W. Anderson
Center for Applied Scientific Computing
Email: anderson110@...
Tel: 925-424-2858  Fax: 925-423-8704
__
distcc mailing list            http://distcc.samba.org/
To unsubscribe or change options:
https://lists.samba.org/mailman/listinfo/distcc

Re: homogeneous environments

by Fergus Henderson-5 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

On Tue, Apr 28, 2009 at 4:28 PM, Robert W. Anderson <anderson110@...> wrote:

I have an environment where we have many nodes potentially available for compilation, and all of them see the same file spaces via NFS.  We are seeing decent performance out of distcc 3.1 using pump mode, but from reading the docs there may be big performance gains left to wring out in this special(?) case.

If I understand correctly, distcc's pump mode finds a set of header files necessary to send along with the source file to enable compilation on a remote node.  In a homogeneous environment, it seems both steps here are unnecessary if the master and slave nodes are more or less indistinguishable in terms of compiler, sources, and headers.

I think we could really achieve some screaming compile times (over thousands of source files) if these steps could be bypassed with the user's explicit acknowledgement that he is making assumptions about the homogeneity of his build server machines.

How extensive would the modifications be to support such an optimization?  It was not clear to me after a few minutes of poking around in the source, and thought I'd seek an expert opinion first.

Typically NFS is a lot slower than local file access.
So it's not clear that this approach would actually improve overall performance.

Distcc can work faster than NFS, because it sends all of the source files at once, requiring only one round-trip between the client and the distcc server for each compilation.  With NFS, you need a round-trip between the distcc server and the NFS server for each header file that is included (directly or indirectly) from the source file being compiled.

Of course with distcc, if your source files are on NFS, the client needs to do the same round-trips to the NFS server to fetch the files, but this is not as bad as having the distcc servers do that, because the distcc client need only fetch each file once for the whole build, not once for each compilation in which it is referenced, and after that the file will probably be cached.  In addition, the client machine is more likely to have source files cached from previous builds, since on the client machine you're probably compiling the same sources that you compiled last time, whereas on the distcc server machines they are serving lots of different users who may be compiling very different programs.

Another issue with this approach is that there may also be additional security considerations.  Currently distcc servers normally run as user "distcc", which may not have access to the user's NFS files, so this approach would not work if the source files are not world-readable.  Of course it would be possible to address this issue by having the distcc server authenticate the user, and then access the user's files on NFS as that user, but that would require additional authentication, which would have a performance impact.  For example one way to do it would be to use distcc's ssh mode, but that mode has a major performance impact. (The recently posted patches for GSSAPI support have less performance impact, but there is still a significant impact.)

For the approach that you are considering, you may not need to use distcc at all;
a simple script using ssh may be sufficient, though the overheads of ssh may be prohibitive (ssh connection sharing may help with that, although that has security concerns of its own).
If you do want to modify distcc, I'd guess that the modifications needed would be moderate in scope.

Cheers,
  Fergus.

--
Fergus Henderson <fergus@...>

__
distcc mailing list            http://distcc.samba.org/
To unsubscribe or change options:
https://lists.samba.org/mailman/listinfo/distcc

Re: homogeneous environments

by Robert W. Anderson :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Fergus Henderson wrote:

> On Tue, Apr 28, 2009 at 4:28 PM, Robert W. Anderson
> <anderson110@... <mailto:anderson110@...>> wrote:
>
>
>     I have an environment where we have many nodes potentially available
>     for compilation, and all of them see the same file spaces via NFS.
>      We are seeing decent performance out of distcc 3.1 using pump mode,
>     but from reading the docs there may be big performance gains left to
>     wring out in this special(?) case.
>
>     If I understand correctly, distcc's pump mode finds a set of header
>     files necessary to send along with the source file to enable
>     compilation on a remote node.  In a homogeneous environment, it
>     seems both steps here are unnecessary if the master and slave nodes
>     are more or less indistinguishable in terms of compiler, sources,
>     and headers.
>
>     I think we could really achieve some screaming compile times (over
>     thousands of source files) if these steps could be bypassed with the
>     user's explicit acknowledgement that he is making assumptions about
>     the homogeneity of his build server machines.
>
>     How extensive would the modifications be to support such an
>     optimization?  It was not clear to me after a few minutes of poking
>     around in the source, and thought I'd seek an expert opinion first.
>
>
> Typically NFS is a lot slower than local file access.
> So it's not clear that this approach would actually improve overall
> performance.
>
> Distcc can work faster than NFS, because it sends all of the source
> files at once, requiring only one round-trip between the client and the
> distcc server for each compilation.  With NFS, you need a round-trip
> between the distcc server and the NFS server for each header file that
> is included (directly or indirectly) from the source file being compiled.
>
> Of course with distcc, if your source files are on NFS, the client needs
> to do the same round-trips to the NFS server to fetch the files, but
> this is not as bad as having the distcc servers do that, because the
> distcc client need only fetch each file once for the whole build, not
> once for each compilation in which it is referenced, and after that the
> file will probably be cached.  In addition, the client machine is more
> likely to have source files cached from previous builds, since on the
> client machine you're probably compiling the same sources that you
> compiled last time, whereas on the distcc server machines they are
> serving lots of different users who may be compiling very different
> programs.
>
> Another issue with this approach is that there may also be additional
> security considerations.  Currently distcc servers normally run as user
> "distcc", which may not have access to the user's NFS files, so this
> approach would not work if the source files are not world-readable.  Of
> course it would be possible to address this issue by having the distcc
> server authenticate the user, and then access the user's files on NFS as
> that user, but that would require additional authentication, which would
> have a performance impact.  For example one way to do it would be to use
> distcc's ssh mode, but that mode has a major performance impact. (The
> recently posted patches for GSSAPI support have less performance impact,
> but there is still a significant impact.)
>
> For the approach that you are considering, you may not need to use
> distcc at all;
> a simple script using ssh may be sufficient, though the overheads of ssh
> may be prohibitive (ssh connection sharing may help with that, although
> that has security concerns of its own).
> If you do want to modify distcc, I'd guess that the modifications needed
> would be moderate in scope.

Fergus,

Thanks for the clear and detailed reply.  First I should note that I am
already using ssh mode (via rsh) because I was unable to make TCP mode
work.  I don't know if this is some kind of port blocking restriction on
my machine or what:

distcc[1663] (dcc_pump_sendfile) ERROR: sendfile failed: Connection
reset by peer
distcc[1663] (dcc_writex) ERROR: failed to write: Broken pipe
distcc[1663] Warning: failed to distribute source.c to host16,cpp,lzo,
running locally instead

Perhaps getting TCP mode running should be my first performance priority.

I just tried what you suggested in your last paragraph, manually
distributing compiles via rsh, and am finding that it is, as you
suspected, a little slower than distcc using pump mode.  Rather than
pursue that any further, based on your comments, I would like to see if
I can get TCP pump mode working first.

Thanks,
--
Robert W. Anderson
Center for Applied Scientific Computing
Email: anderson110@...
Tel: 925-424-2858  Fax: 925-423-8704
__
distcc mailing list            http://distcc.samba.org/
To unsubscribe or change options:
https://lists.samba.org/mailman/listinfo/distcc

Re: homogeneous environments

by Fergus Henderson-4 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message



On Wed, Apr 29, 2009 at 1:39 PM, Robert W. Anderson <anderson110@...> wrote:
Fergus Henderson wrote:

On Tue, Apr 28, 2009 at 4:28 PM, Robert W. Anderson <anderson110@... <mailto:anderson110@...>> wrote:


   I have an environment where we have many nodes potentially available
   for compilation, and all of them see the same file spaces via NFS.
    We are seeing decent performance out of distcc 3.1 using pump mode,
   but from reading the docs there may be big performance gains left to
   wring out in this special(?) case.

   If I understand correctly, distcc's pump mode finds a set of header
   files necessary to send along with the source file to enable
   compilation on a remote node.  In a homogeneous environment, it
   seems both steps here are unnecessary if the master and slave nodes
   are more or less indistinguishable in terms of compiler, sources,
   and headers.

   I think we could really achieve some screaming compile times (over
   thousands of source files) if these steps could be bypassed with the
   user's explicit acknowledgement that he is making assumptions about
   the homogeneity of his build server machines.

   How extensive would the modifications be to support such an
   optimization?  It was not clear to me after a few minutes of poking
   around in the source, and thought I'd seek an expert opinion first.


Typically NFS is a lot slower than local file access.
So it's not clear that this approach would actually improve overall performance.

Distcc can work faster than NFS, because it sends all of the source files at once, requiring only one round-trip between the client and the distcc server for each compilation.  With NFS, you need a round-trip between the distcc server and the NFS server for each header file that is included (directly or indirectly) from the source file being compiled.

Of course with distcc, if your source files are on NFS, the client needs to do the same round-trips to the NFS server to fetch the files, but this is not as bad as having the distcc servers do that, because the distcc client need only fetch each file once for the whole build, not once for each compilation in which it is referenced, and after that the file will probably be cached.  In addition, the client machine is more likely to have source files cached from previous builds, since on the client machine you're probably compiling the same sources that you compiled last time, whereas on the distcc server machines they are serving lots of different users who may be compiling very different programs.

Another issue with this approach is that there may also be additional security considerations.  Currently distcc servers normally run as user "distcc", which may not have access to the user's NFS files, so this approach would not work if the source files are not world-readable.  Of course it would be possible to address this issue by having the distcc server authenticate the user, and then access the user's files on NFS as that user, but that would require additional authentication, which would have a performance impact.  For example one way to do it would be to use distcc's ssh mode, but that mode has a major performance impact. (The recently posted patches for GSSAPI support have less performance impact, but there is still a significant impact.)

For the approach that you are considering, you may not need to use distcc at all;
a simple script using ssh may be sufficient, though the overheads of ssh may be prohibitive (ssh connection sharing may help with that, although that has security concerns of its own).
If you do want to modify distcc, I'd guess that the modifications needed would be moderate in scope.

Fergus,

Thanks for the clear and detailed reply.  First I should note that I am already using ssh mode (via rsh) because I was unable to make TCP mode work.  I don't know if this is some kind of port blocking restriction on my machine or what:

distcc[1663] (dcc_pump_sendfile) ERROR: sendfile failed: Connection reset by peer
distcc[1663] (dcc_writex) ERROR: failed to write: Broken pipe
distcc[1663] Warning: failed to distribute source.c to host16,cpp,lzo, running locally instead

It looks like distccd is dropping the connection?
Have a look in the distccd log file (e.g. /var/log/syslog or /var/log/messages on host16); there may be more information there.

Otherwise, maybe there is some issue with the sys_sendfile() implementation on your system.
You may be able to make it work by using dcc_pump_readwrite() rather than sys_sendfile().
See the source for the dcc_pump_sendfile() function in src/sendfile.c for more info.

Perhaps getting TCP mode running should be my first performance priority.

I just tried what you suggested in your last paragraph, manually distributing compiles via rsh, and am finding that it is, as you suspected, a little slower than distcc using pump mode.  Rather than pursue that any further, based on your comments, I would like to see if I can get TCP pump mode working first.

Sounds like a good plan.
--
Fergus Henderson <fergus@...>

__
distcc mailing list            http://distcc.samba.org/
To unsubscribe or change options:
https://lists.samba.org/mailman/listinfo/distcc

Re: homogeneous environments

by Robert W. Anderson :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Fergus Henderson wrote:

>     distcc[1663] (dcc_pump_sendfile) ERROR: sendfile failed: Connection
>     reset by peer
>     distcc[1663] (dcc_writex) ERROR: failed to write: Broken pipe
>     distcc[1663] Warning: failed to distribute source.c to
>     host16,cpp,lzo, running locally instead
>
>
> It looks like distccd is dropping the connection?
> Have a look in the distccd log file (e.g. /var/log/syslog or
> /var/log/messages on host16); there may be more information there.

Neither of those existed so I restarted the daemon with the --log-file
option and got:

distccd[2895] (dcc_check_client) ERROR: connection from client
'192.168.120.3:55948' denied by access list

The plot thickens a bit.  I had been using the eth0 address in my
--allow option, but the address I see here is related to an InfiniBand
interconnect, according to ifconfig.  If I try allowing the address
192.168.120.3, it reports the same "denied" error.  If I use the full
address as shown with the colon suffix, I get an error "can't parse
internet address".

Any ideas?

Thanks,
--
Robert W. Anderson
Center for Applied Scientific Computing
Email: anderson110@...
Tel: 925-424-2858  Fax: 925-423-8704
__
distcc mailing list            http://distcc.samba.org/
To unsubscribe or change options:
https://lists.samba.org/mailman/listinfo/distcc

Re: homogeneous environments

by Fergus Henderson-4 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message



On Wed, Apr 29, 2009 at 3:20 PM, Robert W. Anderson <anderson110@...> wrote:
Fergus Henderson wrote:
   distcc[1663] (dcc_pump_sendfile) ERROR: sendfile failed: Connection
   reset by peer
   distcc[1663] (dcc_writex) ERROR: failed to write: Broken pipe
   distcc[1663] Warning: failed to distribute source.c to
   host16,cpp,lzo, running locally instead


It looks like distccd is dropping the connection?
Have a look in the distccd log file (e.g. /var/log/syslog or /var/log/messages on host16); there may be more information there.

Neither of those existed so I restarted the daemon with the --log-file option and got:

distccd[2895] (dcc_check_client) ERROR: connection from client '192.168.120.3:55948' denied by access list

The plot thickens a bit.  I had been using the eth0 address in my --allow option, but the address I see here is related to an InfiniBand interconnect, according to ifconfig.  If I try allowing the address 192.168.120.3, it reports the same "denied" error.  If I use the full address as shown with the colon suffix, I get an error "can't parse internet address".

Any ideas?

What's the distccd log output if you start distccd with --verbose?

--
Fergus Henderson <fergus@...>

__
distcc mailing list            http://distcc.samba.org/
To unsubscribe or change options:
https://lists.samba.org/mailman/listinfo/distcc

Re: homogeneous environments

by Robert W. Anderson :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Fergus Henderson wrote:

>     The plot thickens a bit.  I had been using the eth0 address in my
>     --allow option, but the address I see here is related to an
>     InfiniBand interconnect, according to ifconfig.  If I try allowing
>     the address 192.168.120.3, it reports the same "denied" error.  If I
>     use the full address as shown with the colon suffix, I get an error
>     "can't parse internet address".
>
>     Any ideas?
>
>
> What's the distccd log output if you start distccd with --verbose?

If you're truly interested, I'll try it.  But in the meantime I just
disabled the client check in the source and everything seemed to work
from that point.

What's possibly more interesting is the performance results.  I'm
compiling about 1300 source files on nodes that have 16 cpu's each.

Single node (plain GNU make, no distcc):

-j2 8m 19s
-j4 5m 46s
-j5 6m 39s
-j8 10m 35s

I don't understand this, but it is repeatable.  Any ideas on that one?
I'm not sure how to effectively profile this. All the sources are on NFS.

So I then went multi-node w/ 4 jobs per node.  Using localhost as a
server only seems to slow things down, incidentally.

1 node,  -j4:  5m 28s (using distcc and 1 remote node)
2 nodes, -j8:  2m 57s
3 nodes, -j12: 2m 16s
4 nodes, -j16: 1m 58s
5 nodes, -j20: 2m 7s

Scaling seems to break down around the 4 node mark.  Our link step is
only 5-6 seconds, so we are not getting bound by that.  Messing with -j
further doesn't seem to help.  Any ideas for profiling this to find any
final bottlenecks?

Thanks,
--
Robert W. Anderson
Center for Applied Scientific Computing
Email: anderson110@...
Tel: 925-424-2858  Fax: 925-423-8704
__
distcc mailing list            http://distcc.samba.org/
To unsubscribe or change options:
https://lists.samba.org/mailman/listinfo/distcc

Re: homogeneous environments

by Fergus Henderson-5 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

On Wed, Apr 29, 2009 at 5:41 PM, Robert W. Anderson <anderson110@...> wrote:
I just disabled the client check in the source and everything seemed to work from that point.

That's OK if you're on a trusted network behind a firewall, or if you're not concerned about security.

What's possibly more interesting is the performance results.  I'm compiling about 1300 source files on nodes that have 16 cpu's each.

Single node (plain GNU make, no distcc):

-j2 8m 19s
-j4 5m 46s
-j5 6m 39s
-j8 10m 35s

I don't understand this, but it is repeatable.  Any ideas on that one?

It looks like your machine probably has 4 CPUs, with each job using nearly 100% CPU.
Or possibly it has 2 CPUs, with each job using about 50% CPU, and spending the remaining time waiting for I/O (but I think this possibility is less likely).
Either way, using "-j4" already gets you the maximum amount of parallelization that you can benefit from, with total CPU utilization for each CPU being near 100%.  Higher levels of parallelism can't increase CPU utilization beyond 100%, but they do add overheads, due to a larger working set, more task switching, and worse locality.  So higher -j levels increase the overall build time.
 
I'm not sure how to effectively profile this. All the sources are on NFS.

So I then went multi-node w/ 4 jobs per node.  Using localhost as a server only seems to slow things down, incidentally.

1 node,  -j4:  5m 28s (using distcc and 1 remote node)
2 nodes, -j8:  2m 57s
3 nodes, -j12: 2m 16s
4 nodes, -j16: 1m 58s
5 nodes, -j20: 2m 7s

Scaling seems to break down around the 4 node mark.  Our link step is only 5-6 seconds, so we are not getting bound by that.  Messing with -j further doesn't seem to help.  Any ideas for profiling this to find any final bottlenecks?

First, try running "top" during the build to determine the CPU usage on your local host.  If it stays near 100%, then the bottleneck is local jobs such as linking and/or include scanning, and top will show you which jobs are using the CPU most.  That's quite likely to be the limiting factor if you have a large number of nodes.

Another possibility is lack of parallelism in your Makefile; you may have 1300 source files, but the dependencies in your Makefile probably mean that you can't actually run 1300 compiles in parallel.  Maybe your Makefile only allows about 16 compiles to run in parallel on average.

--
Fergus Henderson <fergus.henderson@...>

__
distcc mailing list            http://distcc.samba.org/
To unsubscribe or change options:
https://lists.samba.org/mailman/listinfo/distcc

Re: homogeneous environments

by Robert W. Anderson :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Fergus Henderson wrote:

>     What's possibly more interesting is the performance results.  I'm
>     compiling about 1300 source files on nodes that have 16 cpu's each.
>
>     Single node (plain GNU make, no distcc):
>
>     -j2 8m 19s
>     -j4 5m 46s
>     -j5 6m 39s
>     -j8 10m 35s
>
>     I don't understand this, but it is repeatable.  Any ideas on that one?
>
>
> It looks like your machine probably has 4 CPUs, with each job using
> nearly 100% CPU.

My node actually has four quad-core processors.

make -j4 cpu utilization, according to top:

Cpu0  :  0.7%us,  4.0%sy,  0.0%ni, 95.3%id,  0.0%wa,  0.0%hi,  0.0%si
Cpu1  : 13.6%us, 29.7%sy,  0.0%ni, 56.8%id,  0.0%wa,  0.0%hi,  0.0%si
Cpu2  :  0.7%us,  1.8%sy,  0.0%ni, 97.4%id,  0.0%wa,  0.0%hi,  0.0%si
Cpu3  :  0.7%us,  0.7%sy,  0.0%ni, 98.5%id,  0.0%wa,  0.0%hi,  0.0%si
Cpu4  : 26.8%us, 22.4%sy,  0.0%ni, 49.6%id,  0.0%wa,  0.0%hi,  1.1%si
Cpu5  :  7.0%us, 15.8%sy,  0.0%ni, 77.3%id,  0.0%wa,  0.0%hi,  0.0%si
Cpu6  :  2.9%us,  2.9%sy,  0.0%ni, 94.2%id,  0.0%wa,  0.0%hi,  0.0%si
Cpu7  :  2.6%us,  0.7%sy,  0.0%ni, 96.7%id,  0.0%wa,  0.0%hi,  0.0%si
Cpu8  :  0.7%us,  3.7%sy,  0.0%ni, 95.6%id,  0.0%wa,  0.0%hi,  0.0%si
Cpu9  :  9.5%us, 15.8%sy,  0.0%ni, 74.7%id,  0.0%wa,  0.0%hi,  0.0%si
Cpu10 :  0.7%us,  2.9%sy,  0.0%ni, 96.4%id,  0.0%wa,  0.0%hi,  0.0%si
Cpu11 :  0.4%us,  0.4%sy,  0.0%ni, 99.3%id,  0.0%wa,  0.0%hi,  0.0%si
Cpu12 :  1.5%us,  5.8%sy,  0.0%ni, 92.7%id,  0.0%wa,  0.0%hi,  0.0%si
Cpu13 :  5.1%us,  5.1%sy,  0.0%ni, 89.8%id,  0.0%wa,  0.0%hi,  0.0%si
Cpu14 :  6.6%us,  5.1%sy,  0.0%ni, 88.3%id,  0.0%wa,  0.0%hi,  0.0%si
Cpu15 : 27.5%us, 30.0%sy,  0.0%ni, 41.8%id,  0.0%wa,  0.0%hi,  0.7%si

It bounces all over the place but this is not an atypical snapshot.  The
machine is mostly idle, according to the fourth column.  Is this somehow
a major resource contention issue?  Disk access, maybe?  Note that I
have tried local disk access on a /tmp partition, and while overall
performance improves a bit, the scaling with increasing -j does not.
-j5 is still slower than -j4.

A make -j8 run barely eats any more cpu.  This is with fully local disk,
too:

Cpu0  : 20.0%us, 38.2%sy,  0.0%ni, 41.8%id,  0.0%wa,  0.0%hi,  0.0%si
Cpu1  :  5.4%us, 30.4%sy,  0.0%ni, 64.3%id,  0.0%wa,  0.0%hi,  0.0%si
Cpu2  : 14.5%us, 21.8%sy,  0.0%ni, 63.6%id,  0.0%wa,  0.0%hi,  0.0%si
Cpu3  : 12.7%us, 41.8%sy,  0.0%ni, 45.5%id,  0.0%wa,  0.0%hi,  0.0%si
Cpu4  : 20.0%us, 27.3%sy,  0.0%ni, 52.7%id,  0.0%wa,  0.0%hi,  0.0%si
Cpu5  :  3.7%us, 29.6%sy,  0.0%ni, 66.7%id,  0.0%wa,  0.0%hi,  0.0%si
Cpu6  :  1.8%us, 32.7%sy,  0.0%ni, 65.5%id,  0.0%wa,  0.0%hi,  0.0%si
Cpu7  :  1.8%us,  3.5%sy,  0.0%ni, 94.7%id,  0.0%wa,  0.0%hi,  0.0%si
Cpu8  :  1.8%us, 30.9%sy,  0.0%ni, 67.3%id,  0.0%wa,  0.0%hi,  0.0%si
Cpu9  :  5.4%us, 23.2%sy,  0.0%ni, 71.4%id,  0.0%wa,  0.0%hi,  0.0%si
Cpu10 :  3.6%us, 12.7%sy,  0.0%ni, 83.6%id,  0.0%wa,  0.0%hi,  0.0%si
Cpu11 :  0.0%us,  7.3%sy,  0.0%ni, 92.7%id,  0.0%wa,  0.0%hi,  0.0%si
Cpu12 :  3.6%us, 32.7%sy,  0.0%ni, 63.6%id,  0.0%wa,  0.0%hi,  0.0%si
Cpu13 :  5.4%us, 19.6%sy,  0.0%ni, 75.0%id,  0.0%wa,  0.0%hi,  0.0%si
Cpu14 :  0.0%us,  5.5%sy,  0.0%ni, 94.5%id,  0.0%wa,  0.0%hi,  0.0%si
Cpu15 :  3.5%us, 28.1%sy,  0.0%ni, 68.4%id,  0.0%wa,  0.0%hi,  0.0%si

Recall that -j8 results in a 10m build whereas -j4 results in a 5m build.

>     I'm not sure how to effectively profile this. All the sources are on
>     NFS.
>
>     So I then went multi-node w/ 4 jobs per node.  Using localhost as a
>     server only seems to slow things down, incidentally.
>
>     1 node,  -j4:  5m 28s (using distcc and 1 remote node)
>     2 nodes, -j8:  2m 57s
>     3 nodes, -j12: 2m 16s
>     4 nodes, -j16: 1m 58s
>     5 nodes, -j20: 2m 7s
>
>     Scaling seems to break down around the 4 node mark.  Our link step
>     is only 5-6 seconds, so we are not getting bound by that.  Messing
>     with -j further doesn't seem to help.  Any ideas for profiling this
>     to find any final bottlenecks?
>
>
> First, try running "top" during the build to determine the CPU usage on
> your local host.  If it stays near 100%, then the bottleneck is local
> jobs such as linking and/or include scanning, and top will show you
> which jobs are using the CPU most.  That's quite likely to be the
> limiting factor if you have a large number of nodes.

Not surprisingly (now), the localhost CPU is mostly idle as well during
a multi-node build.  A snapshot:

Cpu0  :  0.0%us, 11.5%sy,  0.0%ni, 88.5%id,  0.0%wa,  0.0%hi,  0.0%si
Cpu1  :  1.9%us, 26.9%sy,  0.0%ni, 71.2%id,  0.0%wa,  0.0%hi,  0.0%si
Cpu2  :  1.9%us, 29.6%sy,  0.0%ni, 68.5%id,  0.0%wa,  0.0%hi,  0.0%si
Cpu3  :  1.9%us, 17.0%sy,  0.0%ni, 81.1%id,  0.0%wa,  0.0%hi,  0.0%si
Cpu4  :  9.4%us, 43.4%sy,  0.0%ni, 43.4%id,  0.0%wa,  0.0%hi,  3.8%si
Cpu5  :  3.8%us, 28.3%sy,  0.0%ni, 67.9%id,  0.0%wa,  0.0%hi,  0.0%si
Cpu6  :  1.9%us, 18.9%sy,  0.0%ni, 79.2%id,  0.0%wa,  0.0%hi,  0.0%si
Cpu7  :  1.9%us, 28.8%sy,  0.0%ni, 69.2%id,  0.0%wa,  0.0%hi,  0.0%si
Cpu8  :  1.9%us, 11.3%sy,  0.0%ni, 86.8%id,  0.0%wa,  0.0%hi,  0.0%si
Cpu9  :  3.7%us, 37.0%sy,  0.0%ni, 59.3%id,  0.0%wa,  0.0%hi,  0.0%si
Cpu10 :  1.9%us, 26.4%sy,  0.0%ni, 71.7%id,  0.0%wa,  0.0%hi,  0.0%si
Cpu11 :  0.0%us, 11.3%sy,  0.0%ni, 88.7%id,  0.0%wa,  0.0%hi,  0.0%si
Cpu12 :  1.9%us, 15.4%sy,  0.0%ni, 82.7%id,  0.0%wa,  0.0%hi,  0.0%si
Cpu13 :  1.9%us, 30.2%sy,  0.0%ni, 67.9%id,  0.0%wa,  0.0%hi,  0.0%si
Cpu14 :  3.8%us, 22.6%sy,  0.0%ni, 73.6%id,  0.0%wa,  0.0%hi,  0.0%si
Cpu15 :  1.9%us, 24.5%sy,  0.0%ni, 73.6%id,  0.0%wa,  0.0%hi,  0.0%si


> Another possibility is lack of parallelism in your Makefile; you may
> have 1300 source files, but the dependencies in your Makefile probably
> mean that you can't actually run 1300 compiles in parallel.  Maybe your
> Makefile only allows about 16 compiles to run in parallel on average.

I believe I fixed my makefiles to be, after a couple of short initial
serial steps, fully parallel in compiling the source, both per directory
and per source file.  I do see the directories being interleaved in my
output, and also big bursts of files from the same directory being launched.


--
Robert W. Anderson
Center for Applied Scientific Computing
Email: anderson110@...
Tel: 925-424-2858  Fax: 925-423-8704
__
distcc mailing list            http://distcc.samba.org/
To unsubscribe or change options:
https://lists.samba.org/mailman/listinfo/distcc

Re: homogeneous environments

by Fergus Henderson-4 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

The include server could be the bottleneck.  What's the CPU usage for the include server process?
Or it could be disk I/O.  Try iostat or vmstat to profile that.

On Thu, Apr 30, 2009 at 1:58 PM, Robert W. Anderson <anderson110@...> wrote:
Fergus Henderson wrote:
   What's possibly more interesting is the performance results.  I'm
   compiling about 1300 source files on nodes that have 16 cpu's each.

   Single node (plain GNU make, no distcc):

   -j2 8m 19s
   -j4 5m 46s
   -j5 6m 39s
   -j8 10m 35s

   I don't understand this, but it is repeatable.  Any ideas on that one?


It looks like your machine probably has 4 CPUs, with each job using nearly 100% CPU.

My node actually has four quad-core processors.

make -j4 cpu utilization, according to top:

Cpu0  :  0.7%us,  4.0%sy,  0.0%ni, 95.3%id,  0.0%wa,  0.0%hi,  0.0%si
Cpu1  : 13.6%us, 29.7%sy,  0.0%ni, 56.8%id,  0.0%wa,  0.0%hi,  0.0%si
Cpu2  :  0.7%us,  1.8%sy,  0.0%ni, 97.4%id,  0.0%wa,  0.0%hi,  0.0%si
Cpu3  :  0.7%us,  0.7%sy,  0.0%ni, 98.5%id,  0.0%wa,  0.0%hi,  0.0%si
Cpu4  : 26.8%us, 22.4%sy,  0.0%ni, 49.6%id,  0.0%wa,  0.0%hi,  1.1%si
Cpu5  :  7.0%us, 15.8%sy,  0.0%ni, 77.3%id,  0.0%wa,  0.0%hi,  0.0%si
Cpu6  :  2.9%us,  2.9%sy,  0.0%ni, 94.2%id,  0.0%wa,  0.0%hi,  0.0%si
Cpu7  :  2.6%us,  0.7%sy,  0.0%ni, 96.7%id,  0.0%wa,  0.0%hi,  0.0%si
Cpu8  :  0.7%us,  3.7%sy,  0.0%ni, 95.6%id,  0.0%wa,  0.0%hi,  0.0%si
Cpu9  :  9.5%us, 15.8%sy,  0.0%ni, 74.7%id,  0.0%wa,  0.0%hi,  0.0%si
Cpu10 :  0.7%us,  2.9%sy,  0.0%ni, 96.4%id,  0.0%wa,  0.0%hi,  0.0%si
Cpu11 :  0.4%us,  0.4%sy,  0.0%ni, 99.3%id,  0.0%wa,  0.0%hi,  0.0%si
Cpu12 :  1.5%us,  5.8%sy,  0.0%ni, 92.7%id,  0.0%wa,  0.0%hi,  0.0%si
Cpu13 :  5.1%us,  5.1%sy,  0.0%ni, 89.8%id,  0.0%wa,  0.0%hi,  0.0%si
Cpu14 :  6.6%us,  5.1%sy,  0.0%ni, 88.3%id,  0.0%wa,  0.0%hi,  0.0%si
Cpu15 : 27.5%us, 30.0%sy,  0.0%ni, 41.8%id,  0.0%wa,  0.0%hi,  0.7%si

It bounces all over the place but this is not an atypical snapshot.  The machine is mostly idle, according to the fourth column.  Is this somehow a major resource contention issue?  Disk access, maybe?  Note that I have tried local disk access on a /tmp partition, and while overall performance improves a bit, the scaling with increasing -j does not. -j5 is still slower than -j4.

A make -j8 run barely eats any more cpu.  This is with fully local disk, too:

Cpu0  : 20.0%us, 38.2%sy,  0.0%ni, 41.8%id,  0.0%wa,  0.0%hi,  0.0%si
Cpu1  :  5.4%us, 30.4%sy,  0.0%ni, 64.3%id,  0.0%wa,  0.0%hi,  0.0%si
Cpu2  : 14.5%us, 21.8%sy,  0.0%ni, 63.6%id,  0.0%wa,  0.0%hi,  0.0%si
Cpu3  : 12.7%us, 41.8%sy,  0.0%ni, 45.5%id,  0.0%wa,  0.0%hi,  0.0%si
Cpu4  : 20.0%us, 27.3%sy,  0.0%ni, 52.7%id,  0.0%wa,  0.0%hi,  0.0%si
Cpu5  :  3.7%us, 29.6%sy,  0.0%ni, 66.7%id,  0.0%wa,  0.0%hi,  0.0%si
Cpu6  :  1.8%us, 32.7%sy,  0.0%ni, 65.5%id,  0.0%wa,  0.0%hi,  0.0%si
Cpu7  :  1.8%us,  3.5%sy,  0.0%ni, 94.7%id,  0.0%wa,  0.0%hi,  0.0%si
Cpu8  :  1.8%us, 30.9%sy,  0.0%ni, 67.3%id,  0.0%wa,  0.0%hi,  0.0%si
Cpu9  :  5.4%us, 23.2%sy,  0.0%ni, 71.4%id,  0.0%wa,  0.0%hi,  0.0%si
Cpu10 :  3.6%us, 12.7%sy,  0.0%ni, 83.6%id,  0.0%wa,  0.0%hi,  0.0%si
Cpu11 :  0.0%us,  7.3%sy,  0.0%ni, 92.7%id,  0.0%wa,  0.0%hi,  0.0%si
Cpu12 :  3.6%us, 32.7%sy,  0.0%ni, 63.6%id,  0.0%wa,  0.0%hi,  0.0%si
Cpu13 :  5.4%us, 19.6%sy,  0.0%ni, 75.0%id,  0.0%wa,  0.0%hi,  0.0%si
Cpu14 :  0.0%us,  5.5%sy,  0.0%ni, 94.5%id,  0.0%wa,  0.0%hi,  0.0%si
Cpu15 :  3.5%us, 28.1%sy,  0.0%ni, 68.4%id,  0.0%wa,  0.0%hi,  0.0%si

Recall that -j8 results in a 10m build whereas -j4 results in a 5m build.


   I'm not sure how to effectively profile this. All the sources are on
   NFS.

   So I then went multi-node w/ 4 jobs per node.  Using localhost as a
   server only seems to slow things down, incidentally.

   1 node,  -j4:  5m 28s (using distcc and 1 remote node)
   2 nodes, -j8:  2m 57s
   3 nodes, -j12: 2m 16s
   4 nodes, -j16: 1m 58s
   5 nodes, -j20: 2m 7s

   Scaling seems to break down around the 4 node mark.  Our link step
   is only 5-6 seconds, so we are not getting bound by that.  Messing
   with -j further doesn't seem to help.  Any ideas for profiling this
   to find any final bottlenecks?


First, try running "top" during the build to determine the CPU usage on your local host.  If it stays near 100%, then the bottleneck is local jobs such as linking and/or include scanning, and top will show you which jobs are using the CPU most.  That's quite likely to be the limiting factor if you have a large number of nodes.

Not surprisingly (now), the localhost CPU is mostly idle as well during a multi-node build.  A snapshot:

Cpu0  :  0.0%us, 11.5%sy,  0.0%ni, 88.5%id,  0.0%wa,  0.0%hi,  0.0%si
Cpu1  :  1.9%us, 26.9%sy,  0.0%ni, 71.2%id,  0.0%wa,  0.0%hi,  0.0%si
Cpu2  :  1.9%us, 29.6%sy,  0.0%ni, 68.5%id,  0.0%wa,  0.0%hi,  0.0%si
Cpu3  :  1.9%us, 17.0%sy,  0.0%ni, 81.1%id,  0.0%wa,  0.0%hi,  0.0%si
Cpu4  :  9.4%us, 43.4%sy,  0.0%ni, 43.4%id,  0.0%wa,  0.0%hi,  3.8%si
Cpu5  :  3.8%us, 28.3%sy,  0.0%ni, 67.9%id,  0.0%wa,  0.0%hi,  0.0%si
Cpu6  :  1.9%us, 18.9%sy,  0.0%ni, 79.2%id,  0.0%wa,  0.0%hi,  0.0%si
Cpu7  :  1.9%us, 28.8%sy,  0.0%ni, 69.2%id,  0.0%wa,  0.0%hi,  0.0%si
Cpu8  :  1.9%us, 11.3%sy,  0.0%ni, 86.8%id,  0.0%wa,  0.0%hi,  0.0%si
Cpu9  :  3.7%us, 37.0%sy,  0.0%ni, 59.3%id,  0.0%wa,  0.0%hi,  0.0%si
Cpu10 :  1.9%us, 26.4%sy,  0.0%ni, 71.7%id,  0.0%wa,  0.0%hi,  0.0%si
Cpu11 :  0.0%us, 11.3%sy,  0.0%ni, 88.7%id,  0.0%wa,  0.0%hi,  0.0%si
Cpu12 :  1.9%us, 15.4%sy,  0.0%ni, 82.7%id,  0.0%wa,  0.0%hi,  0.0%si
Cpu13 :  1.9%us, 30.2%sy,  0.0%ni, 67.9%id,  0.0%wa,  0.0%hi,  0.0%si
Cpu14 :  3.8%us, 22.6%sy,  0.0%ni, 73.6%id,  0.0%wa,  0.0%hi,  0.0%si
Cpu15 :  1.9%us, 24.5%sy,  0.0%ni, 73.6%id,  0.0%wa,  0.0%hi,  0.0%si



Another possibility is lack of parallelism in your Makefile; you may have 1300 source files, but the dependencies in your Makefile probably mean that you can't actually run 1300 compiles in parallel.  Maybe your Makefile only allows about 16 compiles to run in parallel on average.

I believe I fixed my makefiles to be, after a couple of short initial serial steps, fully parallel in compiling the source, both per directory and per source file.  I do see the directories being interleaved in my output, and also big bursts of files from the same directory being launched.



--
Robert W. Anderson
Center for Applied Scientific Computing
Email: anderson110@...
Tel: 925-424-2858  Fax: 925-423-8704
__ distcc mailing list            http://distcc.samba.org/
To unsubscribe or change options: https://lists.samba.org/mailman/listinfo/distcc



--
Fergus Henderson <fergus@...>

__
distcc mailing list            http://distcc.samba.org/
To unsubscribe or change options:
https://lists.samba.org/mailman/listinfo/distcc

Re: homogeneous environments

by mbp :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

2009/4/29 Robert W. Anderson <anderson110@...>:

>
> I have an environment where we have many nodes potentially available for
> compilation, and all of them see the same file spaces via NFS.  We are
> seeing decent performance out of distcc 3.1 using pump mode, but from
> reading the docs there may be big performance gains left to wring out in
> this special(?) case.
>
> If I understand correctly, distcc's pump mode finds a set of header files
> necessary to send along with the source file to enable compilation on a
> remote node.  In a homogeneous environment, it seems both steps here are
> unnecessary if the master and slave nodes are more or less indistinguishable
> in terms of compiler, sources, and headers.

I'll just note that there are other existing tools built specifically
for the case of doing parallel builds across homogenous machines
sharing a single filesystem.  I think dmake and pmake may be able to
do this, and there are probably others, because it was a pretty common
configuration for unix machines some years ago.   It does have the
advantage you can potentially parallelize many different steps, not
just compilation.

However, as Fergus says, NFS is typically slower than streaming just
the files that are needed, so it's not a clear win.  You may also get
into trouble with different machines having slightly unsynchronized
views.

--
Martin <http://launchpad.net/~mbp/>
__
distcc mailing list            http://distcc.samba.org/
To unsubscribe or change options:
https://lists.samba.org/mailman/listinfo/distcc

Re: homogeneous environments

by Robert W. Anderson :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Fergus Henderson wrote:
> The include server could be the bottleneck.  What's the CPU usage for
> the include server process?
> Or it could be disk I/O.  Try iostat or vmstat to profile that.

After some more experimentation, I think I may have a clue what's going
on here.  I think I may be getting bound up in context switching costs,
both in my single node performance and on localhost using distcc.

On a single node:

-j4: 24 million context switches @ 10us = 4m
-j8: 96 million context switches @ 10us = 16m

         Command being timed: "make -j4 debug"
         User time (seconds): 378.81
         System time (seconds): 566.57
         Percent of CPU this job got: 265%
         Elapsed (wall clock) time (h:mm:ss or m:ss): 5:55.91
         Major (requiring I/O) page faults: 472
         Minor (reclaiming a frame) page faults: 26761810
         Voluntary context switches: 24629071
         Involuntary context switches: 269418

         Command being timed: "make -j8 debug"
         User time (seconds): 427.55
         System time (seconds): 2250.72
         Percent of CPU this job got: 429%
         Elapsed (wall clock) time (h:mm:ss or m:ss): 10:23.29
         Major (requiring I/O) page faults: 270
         Minor (reclaiming a frame) page faults: 26760678
         Voluntary context switches: 95391888
         Involuntary context switches: 825155

I don't understand why doubling the job count increases context
switching by 4x.  Any insights appreciated.

Thanks,
--
Robert W. Anderson
Center for Applied Scientific Computing
Email: anderson110@...
Tel: 925-424-2858  Fax: 925-423-8704
__
distcc mailing list            http://distcc.samba.org/
To unsubscribe or change options:
https://lists.samba.org/mailman/listinfo/distcc

Re: homogeneous environments

by Fergus Henderson-5 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message



On Thu, Apr 30, 2009 at 9:07 PM, Robert W. Anderson <anderson110@...> wrote:
Fergus Henderson wrote:
The include server could be the bottleneck.  What's the CPU usage for the include server process?
Or it could be disk I/O.  Try iostat or vmstat to profile that.

After some more experimentation, I think I may have a clue what's going on here.  I think I may be getting bound up in context switching costs, both in my single node performance and on localhost using distcc.

On a single node:

-j4: 24 million context switches @ 10us = 4m
-j8: 96 million context switches @ 10us = 16m
...
I don't understand why doubling the job count increases context switching by 4x.  Any insights appreciated.

You mentioned earlier that your host has four processors.
With -j4, you have number of processors = number of jobs, so to a first approximation the only context switches should be switching between user mode and kernel mode for system calls, I/O or paging.
With -j8, you now have two jobs for each processor, so you're doing additional context switches for time slicing between different user mode processes.  Depending on the time slice, that could be arbitrarily higher.

Cheers,
  Fergus.

--
Fergus Henderson <fergus.henderson@...>

__
distcc mailing list            http://distcc.samba.org/
To unsubscribe or change options:
https://lists.samba.org/mailman/listinfo/distcc

Re: homogeneous environments

by Robert W. Anderson :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

Fergus Henderson wrote:

>
>
> On Thu, Apr 30, 2009 at 9:07 PM, Robert W. Anderson
> <anderson110@... <mailto:anderson110@...>> wrote:
>
>     Fergus Henderson wrote:
>
>         The include server could be the bottleneck.  What's the CPU
>         usage for the include server process?
>         Or it could be disk I/O.  Try iostat or vmstat to profile that.
>
>
>     After some more experimentation, I think I may have a clue what's
>     going on here.  I think I may be getting bound up in context
>     switching costs, both in my single node performance and on localhost
>     using distcc.
>
>     On a single node:
>
>     -j4: 24 million context switches @ 10us = 4m
>     -j8: 96 million context switches @ 10us = 16m
>     ...
>
>     I don't understand why doubling the job count increases context
>     switching by 4x.  Any insights appreciated.
>
>
> You mentioned earlier that your host has four processors.

No, I mentioned a few times that it has 16 processors.

Thanks,
--
Robert W. Anderson
Center for Applied Scientific Computing
Email: anderson110@...
Tel: 925-424-2858  Fax: 925-423-8704
__
distcc mailing list            http://distcc.samba.org/
To unsubscribe or change options:
https://lists.samba.org/mailman/listinfo/distcc

Re: homogeneous environments

by Fergus Henderson-4 :: Rate this Message:

Reply to Author | View Threaded | Show Only this Message

On Tue, May 5, 2009 at 12:37 PM, Robert W. Anderson <anderson110@...> wrote:

You mentioned earlier that your host has four processors.

No, I mentioned a few times that it has 16 processors.

Sorry, so you did.
In that case, I don't really understand what's going on.
Contention for the locks for the include servers??
Contention for the include server??
Maybe profiling with strace -c might help?
--
Fergus Henderson <fergus@...>

__
distcc mailing list            http://distcc.samba.org/
To unsubscribe or change options:
https://lists.samba.org/mailman/listinfo/distcc