AMBER Archive (2009)

Subject: RE: [AMBER] MPI process terminated unexpectedly after cluster upgrade

From: Ross Walker (ross_at_rosswalker.co.uk)
Date: Thu Nov 05 2009 - 10:22:19 CST


Hi Dmitri,

The issue here is that the mpirun command used by MPI is not just a simple
script. The most common problem I see when running in parallel is people
using a mpirun command belonging to a different mpi installation that the
mpif90 they used to build AMBER. This leads to very strange and
unpredictable behavior. It looks like your system administrator has done a
lot of customization of your mpi environment which will make it VERY hard
for us to be able to debug what is going on on this list. In such cases you
really have no choice but to try going through the problem with the system
administrator since, if they have made their own mpirun in /usr/local/bin/
etc we have no idea what this is doing or how it would behave.

What is being used for the queuing system? PBS? If so then I suggest editing
your job submission script to include the full path to mpirun. I.e.
/usr/lib/mvapich-intel-x86_64/bin/mpirun to force it to use this and not the
one in /usr/local/bin/ which is obviously of dubious origin. Note I also see
no specification of machine file in your job mpirun command and no evidence
of the mpd daemon being started. Typically an mvapich run submitted through
PBS would look something like this:

---------------------------

#!/bin/tcsh
#PBS -l walltime=24:00:00
# Run on 64 cores
#PBS -l nodes=8:ppn=8
# export all my environment variables to the job
#PBS -V
# job name (default = name of script file)
#PBS -N testjob
#PBS -A account_number_to_charge

mvapich2-start-mpd
setenv NP `wc -l ${PBS_NODEFILE} | cut -d'/' -f1`

setenv AMBERHOME ~/amber10
setenv MV2_SRQ_SIZE 4000

cd /scratch/directory_to_run_job_in/

mpirun -machinefile ${PBS_NODEFILE} -np ${NP} $AMBERHOME/exe/pmemd -O \
     -i mdin -o mdout \
     -p prmtop -c inpcrd \
     -r restrt -x mdcrd </dev/null

mpdallexit
--------------------------

As another note I would also specify the full path to sander.MPI - e.g.
mpirun ..... $AMBERHOME/exe/sander.MPI ... since if you do not have
$AMBERHOME/exe/ or ./ in your path (which you should not since it is VERY
dangerous) you may have execution issues.

Good luck,
Ross

/\
\/
|\oss Walker

| Assistant Research Professor |
| San Diego Supercomputer Center |
| Tel: +1 858 822 0854 | EMail:- ross_at_rosswalker.co.uk |
| http://www.rosswalker.co.uk | http://www.wmd-lab.org/ |

Note: Electronic Mail is not secure, has no guarantee of delivery, may not
be read every day, and should not be used for urgent or sensitive issues.

> -----Original Message-----
> From: amber-bounces_at_ambermd.org [mailto:amber-bounces_at_ambermd.org] On
> Behalf Of Dmitri Nilov
> Sent: Thursday, November 05, 2009 6:55 AM
> To: AMBER Mailing List
> Subject: Re: [AMBER] MPI process terminated unexpectedly after cluster
> upgrade
>
> Hello, mpif90, mpicc and mpirun are all in desired folder
> /usr/lib/mvapich-intel-x86_64/bin/.
>
> User run task by typing in command line "mpirun -np 32 -maxtime 1000 -
> queue
> regular sander.MPI -sander_options". This is instruction for
> /usr/bin/mpirun
> script that just add task to queue. When processors are available this
> task
> is run by genuine mpirun.
>
> I'm not shure that problem is related with this "sophisticated"
> procedure
> because it works before upgrades..
>
> Thanks,
> Dmitri Nilov
> Lomonosov Moscow State University
> On Thu, Nov 5, 2009 at 5:38 PM, Dmitri Nilov <nilovdm_at_gmail.com> wrote:
>
> >
> >
> > On Thu, Nov 5, 2009 at 5:12 PM, Jason Swails
> <jason.swails_at_gmail.com>wrote:
> >
> >> Hello,
> >>
> >> mpirun is NOT used during installation. mpif90 and mpicc are used
> during
> >> installation as the compiler wrappers to invoke the desired
> compilers and
> >> link to the proper mpi libraries in a 'blackbox' sort of way.
> mpirun is
> >> only used during MPI execution in amber. You should make sure that
> your
> >> environment is set up to use the mvapich that you want. Try typing
> "which
> >> mpif90" to see exactly which mpi installation you're using. You can
> use
> >> mpi-selector to choose an mpi installation, and this should
> automatically
> >> set up your environment to use that MPI.
> >>
> >> Your email sounds like it does not allow users to run mpirun
> interactively
> >> (i.e. typing it on the command-line) and it is only available to run
> when
> >> submitted to the queue (though i may be wrong). Have you been
> trying to
> >> run
> >> interactively or submitting to the queue? Your system
> administrators
> >> should
> >> be able to help you get your environment set up correctly to build
> amber.
> >>
> >> Good luck!
> >> Jason
> >>
> >> On Thu, Nov 5, 2009 at 8:39 AM, Dmitri Nilov <nilovdm_at_gmail.com>
> wrote:
> >>
> >> > 'which mpirun' gives /usr/bin only.
> >> > System administrator have given some
> >> > explanations:
> >> > /usr/bin/mpirun is a script to queue the task. All
> mpirun/mpiexec/etc
> >> > scripts are forbidden for regular users for simple 'fool
> protection' of
> >> > head node. Direct task run is forbidden.
> >> > Just before task execution it gets special group priveleges and
> since
> >> > this moment it can execute mpirun from mpi-selector pointed
> location.
> >> >
> >> > So now the question is: how mpirun is used during amber
> installation?
> >> >
> >> > Thanks!
> >> > Dmitri Nilov,
> >> > Lomonosov Moscow State University
> >> >
> >> > On Wed, Nov 4, 2009 at 2:13 AM, Ross Walker
> <ross_at_rosswalker.co.uk>
> >> wrote:
> >> >
> >> > > Hi Dmitri,
> >> > >
> >> > > > And one thing more.
> >> > > > mpif90 is in /usr/lib/mvapich-intel-x86_64/bin/ on cluster.
> There is
> >> > > > also
> >> > > > mpirun in this folder but its execution is forbidden. That was
> made
> >> for
> >> > > > running mpirun only from /usr/bin/. So could it make some
> problem?
> >> > >
> >> > > Yes this is almost certainly your problem. If the mpirun being
> used to
> >> > > execute the code is not the same mpirun as the mpi installation
> used
> >> to
> >> > > build the code you will get very strange and unpredictable
> behavior.
> >> Is
> >> > > there a mpiexec in the mvapich bin directory that you can
> execute? You
> >> > may
> >> > > need to use this instead. Or ask your admin to give you execute
> >> > permission
> >> > > on the mvapich mpirun.
> >> > >
> >> > > Right now if you execute 'which mpirun' which one do you get? If
> it is
> >> > the
> >> > > /usr/bin one and not the mvapich one then this is where your
> problem
> >> > lies.
> >> > >
> >> > > Good luck,
> >> > > Ross
> >> > >
> >> > > /\
> >> > > \/
> >> > > |\oss Walker
> >> > >
> >> > > | Assistant Research Professor |
> >> > > | San Diego Supercomputer Center |
> >> > > | Tel: +1 858 822 0854 | EMail:- ross_at_rosswalker.co.uk |
> >> > > | http://www.rosswalker.co.uk | PGP Key available on request |
> >> > >
> >> > > Note: Electronic Mail is not secure, has no guarantee of
> delivery, may
> >> > not
> >> > > be read every day, and should not be used for urgent or
> sensitive
> >> issues.
> >> > >
> >> > >
> >> > >
> >> > >
> >> > > _______________________________________________
> >> > > AMBER mailing list
> >> > > AMBER_at_ambermd.org
> >> > > http://lists.ambermd.org/mailman/listinfo/amber
> >> > >
> >> > _______________________________________________
> >> > AMBER mailing list
> >> > AMBER_at_ambermd.org
> >> > http://lists.ambermd.org/mailman/listinfo/amber
> >> >
> >>
> >>
> >>
> >> --
> >> ---------------------------------------
> >> Jason M. Swails
> >> Quantum Theory Project,
> >> University of Florida
> >> Ph.D. Graduate Student
> >> 352-392-4032
> >> _______________________________________________
> >> AMBER mailing list
> >> AMBER_at_ambermd.org
> >> http://lists.ambermd.org/mailman/listinfo/amber
> >>
> >
> >
> _______________________________________________
> AMBER mailing list
> AMBER_at_ambermd.org
> http://lists.ambermd.org/mailman/listinfo/amber

_______________________________________________
AMBER mailing list
AMBER_at_ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber