AMBER Archive (2002)

Subject: Re: parallel jobs die with no error message from sander

From: Scott Brozell (sbrozell_at_scripps.edu)
Date: Mon Jul 29 2002 - 17:43:51 CDT


Hello,

I have had similar problems as those of observations 1 through 5 below
using, for example, the Portland Group (pg) compilers and
the ch_p4 device of MPICH 1.2.3 configured with
--with-comm=shared -c++=pgCC -cc=pgcc -fc=pgf77 -f90=pgf90.
My experience benchmarking and testing on two RedHat 7.1 ethernet clusters
(a busy 32 node cluster of dual 1200 MHz AMD Palomino processors and
an under used 16 node cluster of dual Intel 600 MHz Pentium III processors)
is that LAM 6.5.6 and MP_LITE 2.3.1 are more stable than MPICH.
For the jac benchmark LAM, MP_LITE, and MPICH showed similar performance.
I describe how I used LAM and MP_LITE in the following paragraphs.
Another idea is to build a debug version of MPICH as presented on page 70
of the Installation and User's Guide, version 1.2.3.

LAM-MPI 6.5.6 was easy to install following the INSTALL document.
http://www.lam-mpi.org/
I configured with
./configure --prefix somepath --with-fc=pgf77 --with-cxx=pgCC --with-cc=pgcc
Here is a script snippet showing usage of lamboot, mpirun, and lamhalt;
it could easily be modified to a PBS script by replacing $localmachinefile
with $PBS_NODEFILE:

#!/bin/sh
DO_PARALLEL="$MPI_HOME/bin/mpirun C "
localmachinefile=./machinefile.$$
for number_of_processors in $@
do
    cat $NODEFILE | tail -$number_of_processors > $localmachinefile
    lamboot -v $localmachinefile
    $DO_PARALLEL sander -O -i mdin -c inpcrd.equil -o stdout 1> $number_of_processors.$$ 2>&1
    done
    lamhalt -v $localmachinefile
    if [ $? -ne 0 ]
    then
        wipe -v $localmachinefile
    fi
done

MP_LITE 2.3.1 was trivial to install following the README document.
http://www.scl.ameslab.gov/Projects/MP_Lite/MP_Lite.html
However, because MP_LITE does not provide mpi_reduce, sander/debug.f
and sander/parallel.f had to be edited to comment out the mpi_reduce calls.
This should not cause problems for molecular dynamics jobs as opposed
to minimization runs. I may write an mpi_reduce routine for MP_LITE.
Here is a script snippet showing usage of mprun;
it could easily be modified to a PBS script by replacing $localmachinefile
with $PBS_NODEFILE:

#!/bin/sh
DO_PARALLEL="$MPI_HOME/bin/mprun -np "
localmachinefile=./machinefile.$$
for number_of_processors in $@
do
    cat $NODEFILE | tail -$number_of_processors > $localmachinefile
    $DO_PARALLEL $number_of_processors -h `cat $localmachinefile` sander -O -i mdin -c inpcrd.equil -o stdout 1> $number_of_processors.$$ 2>&1
done

Scott Brozell, Ph.D. | e-mail: sbrozell_at_scripps.edu
Dept. of Molecular Biology, TPC15 | fax: +1-858-784-8896
The Scripps Research Institute | phone: +1-858-784-8754
10550 N. Torrey Pines Rd. | home page:
La Jolla CA 92037 USA | http://www.scripps.edu/~sbrozell

On Mon, 29 Jul 2002 lamon_at_lav.boehringer-ingelheim.com wrote:

>
> > I am having the same problem with Amber7/sander as Mark Hsieh (see below)
> > in which my jobs stop at random points for no obvious reason. My error
> > output for a job running on four processors looks like the following:
> >
> > net_recv failed for fd = 8
> > p0_25797: p4_error: net_recv read, errno = : 104
> > bm_list_25798: p4_error: interrupt SIGINT: 2
> > rm_l_3_31636: p4_error: interrupt SIGINT: 2
> > p1_19028: p4_error: interrupt SIGINT: 2
> > Broken pipe
> > rm_l_2_7717: p4_error: interrupt SIGINT: 2
> > p3_31635: p4_error: interrupt SIGINT: 2
> > Broken pipe
> > rm_l_1_19029: p4_error: interrupt SIGINT: 2
> > Broken pipe
> > /software/mpich-1.2.1/bin/mpirun: line 1: 25797 Broken pipe
> > /software/amber7/exe_lnx_pll/sander "-i" "md.in" "-o" "md2.out" "-p"
> > "md.parm" "-c" "md1.rst" "-r" "md2.rst" "-x" "md2.crd" "-O" -p4pg PI25665
> > -p4wd .
> >
> >
> > Here are my observations:
> >
> > 1. My jobs end unexpectedly with no error message from sander.
> > 2. The problem is intermittent. If I run it one time, it might die after
> > 2 ps and another time it will die after 50 ps.
> > I have some systems in which it does not happen at all.
> > 3. Output from sander stops several minutes before the job exits.
> > 4. There are no system error messages indicating a hardware failure.
> > 5. It only happens with jobs running on more than one processor.
> > 6. I have tried it on two different linux clusters (compiled with g77)
> > running two different versions of mpich (1.2.1 and 1.2.4).
> > 7. A job run with the same input parameters and restart file but using
> > Amber6/sander will run with no problems.
> >
> > Any suggestions?
> >
> > Thanks,
> > Lynn
> >
> >
> >
> > Dr. Lynn Amon
> > Research Scientist
> > Boehringer-Ingelheim (Canada) Ltd.
> > 2100 Cunard Street
> > Laval (Quebec) Canada H7S 2G5
> > (450) 682-4640
> > lamon_at_lav.boehringer-ingelheim.com
> >
> >
> >
> >
> > From: "Mark Hsieh" <hsieh_at_abmaxis.com
> > <mailto:hsieh_at_abmaxis.com?subject=Re:%20mpirun/sander%20problem&replyto=MP
> > EKLFFCDNOPEMCNBJCGMEPFCAAA.hsieh_at_abmaxis.com>>
> > Subject: mpirun/sander problem
> > Date: Mon, 6 May 2002 13:00:32 -0700
> > Message-ID: <MPEKLFFCDNOPEMCNBJCGMEPFCAAA.hsieh_at_abmaxis.com>
> >
> > Hi,
> > For some reason, my mpirun/sander molecular dynamics simulations are
> > stopping at random time points with the following errors:
> > p0_4155: (11784.142298) net_recv failed for fd = 6
> > p0_4155: p4_error: net_recv read, errno = : 110
> > p3_2900: (11783.828019) net_recv failed for fd = 6
> > /disk1/app/mpich-1.2.3/bin/mpirun: line 1: 4155 Broken pipe
> > /disk1/app/amber7/exe/sander "-O" "-i" "md7.in" "-o" "md7.out" "-p"
> > "prmtop"
> > "-c" "min7.rst" "-r" "md7.rst" "-x" "md7.mdcrd" "-ref" "min7.rst" "-inf"
> > "md7.mdinfo" -p4pg pgfile4 -p4wd /disk1/hsieh.tmp/amber.runs/
> > P4 procgroup file is pgfile4.
> > p3_2900: p4_error: net_recv read, errno = : 104
> > pgfile4 calls up two dual PIII workstations:
> > tiger 0 /disk1/app/amber7/exe/sander
> > tiger 1 /disk1/app/amber7/exe/sander
> > cow2 1 /disk1/app/amber7/exe/sander
> > cow2 1 /disk1/app/amber7/exe/sander
> > Three identical run produced mdinfo files that indicated 4.1, 19.1 and 7.1
> >
> > ps as their last update.
> > Thank you,
> > Mark
> >