AMBER Archive (2007)

Subject: AMBER: problem running parallel jobs

From: Nikola Trbovic (nt2146_at_columbia.edu)
Date: Tue Jun 05 2007 - 15:43:02 CDT


Dear all,

I'm having problems running pmemd and sander with mpi on more than 2
nodes over gigabit ethernet. Shortly after starting the job, one of the
nodes (which one is random) reports a network error associated with the
tg3 driver:

tg3: eth0: transmit timed out, resetting
tg3: tg3_stop_block timed out, ofs=2c00 enable_bit=2
...

This node then disappears from the network for a couple of minutes and
the job stalls, although it doesn't terminate.

Running 4 processes on one node, or even 8 on two nodes works fine,
however. I've tried using mpich2 and mpich, with fftw and without - it
made no difference. I'm compiling pmemd with ifort on RHEL 4. I know
this all indicates that it is not a problem with amber, but instead with
my OS/tg3 driver. But I was wondering if anybody had experienced the
same previously and could give advice on how to fix it.

Thanks a lot in advance,
Nikola Trbovic

-----------------------------------------------------------------------
The AMBER Mail Reflector
To post, send mail to amber_at_scripps.edu
To unsubscribe, send "unsubscribe amber" to majordomo_at_scripps.edu