AMBER Archive (2007)

Subject: Re: AMBER: problem running parallel jobs

From: Robert Konecny (rok_at_ucsd.edu)
Date: Tue Jun 05 2007 - 16:34:44 CDT


Hi Nikola,

try to disable the tcp segmentation offload on your eth0:

/usr/sbin/ethtool -K eth0 tso off

some versions of the tg3 driver choke on heavier traffic.

robert

On Tue, Jun 05, 2007 at 04:43:02PM -0400, Nikola Trbovic wrote:
> Dear all,
>
> I'm having problems running pmemd and sander with mpi on more than 2
> nodes over gigabit ethernet. Shortly after starting the job, one of the
> nodes (which one is random) reports a network error associated with the
> tg3 driver:
>
> tg3: eth0: transmit timed out, resetting
> tg3: tg3_stop_block timed out, ofs=2c00 enable_bit=2
> ...
>
> This node then disappears from the network for a couple of minutes and
> the job stalls, although it doesn't terminate.
>
> Running 4 processes on one node, or even 8 on two nodes works fine,
> however. I've tried using mpich2 and mpich, with fftw and without - it
> made no difference. I'm compiling pmemd with ifort on RHEL 4. I know
> this all indicates that it is not a problem with amber, but instead with
> my OS/tg3 driver. But I was wondering if anybody had experienced the
> same previously and could give advice on how to fix it.
>
> Thanks a lot in advance,
> Nikola Trbovic
>
> -----------------------------------------------------------------------
> The AMBER Mail Reflector
> To post, send mail to amber_at_scripps.edu
> To unsubscribe, send "unsubscribe amber" to majordomo_at_scripps.edu
-----------------------------------------------------------------------
The AMBER Mail Reflector
To post, send mail to amber_at_scripps.edu
To unsubscribe, send "unsubscribe amber" to majordomo_at_scripps.edu