AMBER Archive (2009)

Subject: Re: [AMBER] MPI process terminated unexpectedly after cluster upgrade

From: Jason Swails (jason.swails_at_gmail.com)
Date: Tue Nov 03 2009 - 08:11:48 CST


You should just need to run make test.parallel to test sander.MPI. If the
test cases are failing, it's almost certainly your installation. Also make
sure you have your MPI_HOME variable set correctly (to the version of
mvapich you want to use). Do a thorough make clean to make sure you don't
have any remnants of the old installation left over, then do make parallel
(after configuring, of course). ./configure_amber -mpich ifort will produce
a config.h file that creates a working installation for mvapich (that's how
I've done it), but I'm not an expert in these minute details. However, if
you use the appropriate wrappers for your installation (i.e. make sure your
fortran compiler is mpif90), then you should be fine.

Just make sure that your MPI_HOME is set to the location of the mvapich
version you wish to use. Many compute clusters leave on old versions of MPI
implementations so that users don't have to go through this each time.

As for your query about forbidden execution of mpirun, using /usr/bin/mpirun
will only be OK if that file is a link to, or a copy of, the mpirun in
/usr/lib/mvapich-intel-x86_64/bin/mpirun. This sounds like a question to
ask the administrator of the cluster.

Good luck!
Jason

On Tue, Nov 3, 2009 at 8:20 AM, Dmitri Nilov <nilovdm_at_gmail.com> wrote:

> Yes, I've followed all these instructions. Program is
> Amber10/Sander.MPI. Serial tests are OK. Most of parallel test cases are
> finished with "possible FAILURE: check *.dif", and corresponding
> sander.MPI.out files contain similar error.
> What test cases are most appropriate to analyse outputs?
>
> > ./configure -mvapich ifort
> I suppose that it means ./configure_amber -mpich ifort?
>
> I don't suppose there could be serious mistakes in infiniband or mvapich
> installation.
>
> Thanks!
> Dmitri Nilov,
> Lomonosov Moskow State University
>
> On Tue, Nov 3, 2009 at 2:36 AM, Ross Walker <ross_at_rosswalker.co.uk> wrote:
>
> > Are you certain it is linking to the correct version of infiniband?
> >
> > Make sure you do the following:
> >
> > I assume this is sander but similar instructions should be followed for
> > pmemd.
> >
> > 1) run > which mpif90
> >
> > Check that it is the path you expect. Check that it is the same path as
> > mpirun. Also check that the compute nodes use the same mpirun.
> >
> > 2) cd $AMBERHOME/src/
> > 3) make clean
> > 4) Update your MPI_HOME to point to the NEW mpi location
> > 5) ./configure -mvapich ifort
> > 6) make parallel
> > 7) Run the test suite in parallel and see if this works - probably
> easiest
> > to request an interactive session on your cluster and then set
> DO_PARALLEL
> > to the correct run command. E.g. "mpirun -np 8 -machinefile $PBS_NODEFILE
> "
> > and cd $AMBERHOME/test/; make test.parallel
> >
> > If this crashes then I would check to make sure the new MVAPICH is
> actually
> > working properly. There should be a test suite with it that checks it is
> > working. Is it definitely using the correct version, e.g. the 64 bit
> > version
> > on x86_64?
> >
> > Note, if you just recompiled without making clean and without building a
> > new
> > config_amber.h file and updating your MPI_HOME then it likely has been
> > built
> > with a mix of the old and new versions of MPI which is probably what is
> > causing your problems.
> >
> > Also make sure you are up to date on all the bugfixes.
> >
> > All the best
> > Ross
> >
> > > -----Original Message-----
> > > From: amber-bounces_at_ambermd.org [mailto:amber-bounces_at_ambermd.org] On
> > > Behalf Of Dmitri Nilov
> > > Sent: Monday, November 02, 2009 5:11 AM
> > > To: AMBER Mailing List
> > > Subject: Re: [AMBER] MPI process terminated unexpectedly after cluster
> > > upgrade
> > >
> > > Yes, I've recompiled Amber but I couldn't change mvapich because I'm
> > > just
> > > client on serious cluster)
> > >
> > > On Mon, Nov 2, 2009 at 3:15 PM, Jason Swails <jason.swails_at_gmail.com>
> > > wrote:
> > >
> > > > It could be that the new version of mvapich broke the previous
> > > > installation,
> > > > since the libraries could easily have changed (and if it's really, in
> > > fact,
> > > > a new version, I'd bet on it since there's not much else that could
> > > > 'change'). Did you try recompiling?
> > > >
> > > > Do the test cases still pass? If not, I'd say your only options are
> > > to
> > > > recompile amber/pmemd in parallel or revert back to the old version
> > > of
> > > > mvapich if it's still on the cluster.
> > > >
> > > > Good luck!
> > > > Jason
> > > >
> > > > On Mon, Nov 2, 2009 at 4:17 AM, Dmitri Nilov <nilovdm_at_gmail.com>
> > > wrote:
> > > >
> > > > > Hello!
> > > > > Sander.MPI tasks are crushing just after launch since mvapich
> > > software
> > > > was
> > > > > upgraded on cluster.
> > > > > Sander.MPI.out contains:
> > > > >
> > > > > MPI process terminated unexpectedly
> > > > > Exit code -5 signaled from node-23-06
> > > > > Killing remote processes...forrtl: error (69): process interrupted
> > > > (SIGINT)
> > > > > Image PC Routine Line
> > > > > Source
> > > > > libpthread.so.0 00007F2132C1EB00 Unknown Unknown
> > > > Unknown
> > > > > libpthread.so.0 00007F2132C1DB7E Unknown Unknown
> > > > Unknown
> > > > > libmpich.so.1.0 00007F21334CB1AC Unknown Unknown
> > > > Unknown
> > > > > libmpich.so.1.0 00007F21334E1ADE Unknown Unknown
> > > > Unknown
> > > > > libmpich.so.1.0 00007F21334C050A Unknown Unknown
> > > > Unknown
> > > > > libmpich.so.1.0 00007F21334A2DED Unknown Unknown
> > > > Unknown
> > > > > libmpich.so.1.0 00007F21334A1DC6 Unknown Unknown
> > > > Unknown
> > > > > sander.MPI 000000000093A0EF Unknown Unknown
> > > > Unknown
> > > > > sander.MPI 00000000004BC222 Unknown Unknown
> > > > Unknown
> > > > > sander.MPI 000000000041E05C Unknown Unknown
> > > > Unknown
> > > > > libc.so.6 00007F213216ACF4 Unknown Unknown
> > > > Unknown
> > > > > sander.MPI 000000000041DF69 Unknown Unknown
> > > > Unknown
> > > > > forrtl: error (69): process interrupted (SIGINT)
> > > > > and so on..
> > > > >
> > > > > I've found similar problem at
> > > > > http://archive.ambermd.org/200907/0092.html, that seems to be
> still
> > > > > unsolved.
> > > > > I don't think it's infiniband problem. So what i have to do?
> > > > >
> > > > > Thanks a lot!
> > > > > Dmitri Nilov,
> > > > > Lomonosov Moscow State University
> > > > >
> > > > > _______________________________________________
> > > > > AMBER mailing list
> > > > > AMBER_at_ambermd.org
> > > > > http://lists.ambermd.org/mailman/listinfo/amber
> > > > >
> > > > >
> > > >
> > > >
> > > > --
> > > > ---------------------------------------
> > > > Jason M. Swails
> > > > Quantum Theory Project,
> > > > University of Florida
> > > > Ph.D. Graduate Student
> > > > 352-392-4032
> > > > _______________________________________________
> > > > AMBER mailing list
> > > > AMBER_at_ambermd.org
> > > > http://lists.ambermd.org/mailman/listinfo/amber
> > > >
> > > _______________________________________________
> > > AMBER mailing list
> > > AMBER_at_ambermd.org
> > > http://lists.ambermd.org/mailman/listinfo/amber
> >
> >
> > _______________________________________________
> > AMBER mailing list
> > AMBER_at_ambermd.org
> > http://lists.ambermd.org/mailman/listinfo/amber
> >
> _______________________________________________
> AMBER mailing list
> AMBER_at_ambermd.org
> http://lists.ambermd.org/mailman/listinfo/amber
>

-- 
---------------------------------------
Jason M. Swails
Quantum Theory Project,
University of Florida
Ph.D. Graduate Student
352-392-4032
_______________________________________________
AMBER mailing list
AMBER_at_ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber