AMBER Archive (2007)
Subject: Re: AMBER: problem running parallel jobs

From: Robert Duke (rduke_at_email.unc.edu)
Date: Wed Jun 06 2007 - 13:03:19 CDT

Next message: Nikola Trbovic: "RE: AMBER: problem running parallel jobs"
Previous message: Nikola Trbovic: "RE: AMBER: problem running parallel jobs"
In reply to: Nikola Trbovic: "RE: AMBER: problem running parallel jobs"
Next in thread: Nikola Trbovic: "RE: AMBER: problem running parallel jobs"
Reply: Nikola Trbovic: "RE: AMBER: problem running parallel jobs"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

Nikola -
Yes, a shared nfs volume is a very bad place to write files. Very, very
bad. Write to a local volume if at all possible. An nfs-shared volume can
stall the master process for SECONDS on an attempted write, totally hosing
performance across all your processes. Regarding pmemd performance, you
should not expect too much from gigabit ethernet, but please run the
factor_ix and jac benchmarks that ship with the source code to get results
that can be compared to results on other systems. The only number I would
suggest changing is nstlim, and as the cpu count goes above say 8, I would
start bumping that value up to between 2000 and 5000. With pmemd, all sorts
of loadbalancing goes on and you won't get optimal times for runs of less
than 1000 steps - the code is still doing a ton of loadbalance adjusting,
especially at higher cpu count.
Regards - Bob Duke

----- Original Message -----
From: "Nikola Trbovic" <nt2146_at_columbia.edu>
To: <amber_at_scripps.edu>
Sent: Wednesday, June 06, 2007 1:31 PM
Subject: RE: AMBER: problem running parallel jobs

> Thanks a lot Robert! That solved it! No more network errors!
>
> Now I've performed a few pmemd benchmarks (explicit water, ~16000 atoms)
> over night and obtained terrible performance. First of all let me repeat
> that I'm running on a gigabit cluster with four cores per node. Here are
> the
> benchmark results:
>
> Cores Nodes Time
> 4 1 16113
> 8 2 10251
> 16 4 24143
> 32 8 48138
>
> The odd thing is that while a 2-node job still achieves a 1.6-fold speedup
> over a single node job, a 4-node job achieves no speedup at all but
> instead
> takes more than twice as long as a 2-node job, and an 8-node job four
> times
> as long. So above 2 nodes performance scales linearly with the number of
> nodes! I've read the recent note on pushing the limits with gigabit and
> multiple cores, but I haven't seen any benchmarks reporting such an
> extreme
> drop in performance. I will run new benchmarks after increasing the
> network
> buffers and checking my switch settings, but I still wanted to make sure
> that this type of performance scaling is not perhaps indicative of
> remaining
> problems with my network drivers, mpich2 installation or amber
> installation.
> I am using NFS on the cluster, and the trajectories were being saved
> through
> NFS on the head node. From the latest note on gigabit parallel computing
> it
> sounds like that is a really bad idea. Could it explain the observed
> scaling?
>
> Thanks again to Robert, and in advance for any thoughts about the scaling
> issue,
>
> Nikola
>
> -----Original Message-----
> From: owner-amber_at_scripps.edu [mailto:owner-amber_at_scripps.edu] On Behalf
> Of
> Robert Konecny
> Sent: Tuesday, June 05, 2007 5:35 PM
> To: amber_at_scripps.edu
> Subject: Re: AMBER: problem running parallel jobs
>
> Hi Nikola,
>
> try to disable the tcp segmentation offload on your eth0:
>
> /usr/sbin/ethtool -K eth0 tso off
>
> some versions of the tg3 driver choke on heavier traffic.
>
> robert
>
>
>
> On Tue, Jun 05, 2007 at 04:43:02PM -0400, Nikola Trbovic wrote:
>> Dear all,
>>
>> I'm having problems running pmemd and sander with mpi on more than 2
>> nodes over gigabit ethernet. Shortly after starting the job, one of the
>> nodes (which one is random) reports a network error associated with the
>> tg3 driver:
>>
>> tg3: eth0: transmit timed out, resetting
>> tg3: tg3_stop_block timed out, ofs=2c00 enable_bit=2
>> ...
>>
>> This node then disappears from the network for a couple of minutes and
>> the job stalls, although it doesn't terminate.
>>
>> Running 4 processes on one node, or even 8 on two nodes works fine,
>> however. I've tried using mpich2 and mpich, with fftw and without - it
>> made no difference. I'm compiling pmemd with ifort on RHEL 4. I know
>> this all indicates that it is not a problem with amber, but instead with
>> my OS/tg3 driver. But I was wondering if anybody had experienced the
>> same previously and could give advice on how to fix it.
>>
>> Thanks a lot in advance,
>> Nikola Trbovic
>>
>> -----------------------------------------------------------------------
>> The AMBER Mail Reflector
>> To post, send mail to amber_at_scripps.edu
>> To unsubscribe, send "unsubscribe amber" to majordomo_at_scripps.edu
> -----------------------------------------------------------------------
> The AMBER Mail Reflector
> To post, send mail to amber_at_scripps.edu
> To unsubscribe, send "unsubscribe amber" to majordomo_at_scripps.edu
>
> -----------------------------------------------------------------------
> The AMBER Mail Reflector
> To post, send mail to amber_at_scripps.edu
> To unsubscribe, send "unsubscribe amber" to majordomo_at_scripps.edu
>

-----------------------------------------------------------------------
The AMBER Mail Reflector
To post, send mail to amber_at_scripps.edu
To unsubscribe, send "unsubscribe amber" to majordomo_at_scripps.edu

Next message: Nikola Trbovic: "RE: AMBER: problem running parallel jobs"
Previous message: Nikola Trbovic: "RE: AMBER: problem running parallel jobs"
In reply to: Nikola Trbovic: "RE: AMBER: problem running parallel jobs"
Next in thread: Nikola Trbovic: "RE: AMBER: problem running parallel jobs"
Reply: Nikola Trbovic: "RE: AMBER: problem running parallel jobs"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

AMBER Archive (2007)Subject: Re: AMBER: problem running parallel jobs

AMBER Archive (2007)
Subject: Re: AMBER: problem running parallel jobs