AMBER Archive (2007)

Subject: RE: AMBER: REMD and mpiexec

From: Steve Young (chemadm_at_hamilton.edu)
Date: Thu Jun 07 2007 - 12:47:37 CDT


Hello again,
        Well I am able to verify that the job we are trying to test this out
with is in fact set up properly. I'm able to run it on our IRIX based
clusters using the same files. All that I am modifying is the batch
submission file. Basically changing mpiexec to mpirun since native SGI
mpi doesn't use/have mpiexec.
        This also does run on the cluster in question when we use mpirun but we
run into the original problem of not having PBS nodes in sync with the
actual nodes it runs on. We only run into the error when using it with
mpiexec from OSC.
        So I am guessing that there is something with Amber's Replica Exchange
and the OSC version of mpiexec. Especially since normal MD type
sander.MPI jobs work with mpiexec too. Seems like we are close to
getting this to work but I'm not sure what else I could try.
        I'm thinking perhaps I should re-compile this all with a different
version of mpi like lam instead of mpich. Are there any recommendations
on a version of mpi to test out that works well with torque (PBS) and
Amber9? Thanks in advance for any suggestions.

-Steve

On Wed, 2007-06-06 at 09:06 -0700, Ross Walker wrote:
> Hi Steve,
>
> With the REMD runs if you specify 8 groups then you need to specify either 8
> threads, or 16 threads, or 24 threads etc... The number of mpi threads you
> run must be a multiple of the number of groups you are requesting. This
> doesn't necessarily mean that you need this many processors. For example you
> could run 8 threads on a single dual core machine - for x86 type chips this
> doesn't hurt you too much. I.e. you don't pay too much price for the time
> slicing and you get about 25% performance for each thread. However, on some
> architectures namely IA64 and Pwr4/5 this really really hurts since the
> overhead for swapping threads is huge.
>
> So from the error message below it would appear that the number of threads
> you requested was not a multiple of th enumber of groups. If it was then
> please post a message back and we can try to debug it further.
>
> All the best
> Ross
>
> /\
> \/
> |\oss Walker
>
> | HPC Consultant and Staff Scientist |
> | San Diego Supercomputer Center |
> | Tel: +1 858 822 0854 | EMail:- ross_at_rosswalker.co.uk |
> | http://www.rosswalker.co.uk | PGP Key available on request |
>
> Note: Electronic Mail is not secure, has no guarantee of delivery, may not
> be read every day, and should not be used for urgent or sensitive issues.
>
> > -----Original Message-----
> > From: owner-amber_at_scripps.edu
> > [mailto:owner-amber_at_scripps.edu] On Behalf Of Steve Young
> > Sent: Tuesday, June 05, 2007 17:53
> > To: amber_at_scripps.edu
> > Subject: AMBER: REMD and mpiexec
> >
> > Hello,
> > We have a beowulf cluster that is running
> > torque-2.0.0p7 (PBS), RedHat
> > Enterprise 4, Amber9 and mpich2-1.0.5. We've had a heck of a time with
> > using mpiexec vs. mpirun when trying to run different aspects of
> > sander.MPI.
> >
> > Some history, if we use mpirun (with or without the queue (PBS)
> > system) sander.MPI runs as expected. We get good output with
> > no errors.
> >
> > However, there is one major issue. The nodes that PBS allocates end up
> > not always being the nodes the job runs on. This is a problem since
> > mpich seems to manage job allocation with mpirun itself. So in posting
> > to the mpich-discuss list I found out I needed to use the mpiexec
> > program within the mpich distro. It also turned out I needed
> > the version
> > from OSC that works with torque. After installing OSC mpiexec, we ran
> > some normal sander.MPI jobs and received output as expected. So now we
> > start to test out some Replica Exchange jobs that we've run on other
> > clusters.
> >
> > Here is some of my post from the mpich-discuss listserv:
> >
> > <.... snip ...>
> > Ok so I got the OSC version of mpiexec. This appears to work very well
> > running normal sander.MPI. Requesting 16 cpu's we verify good
> > output and
> > near 100% utilization of all 16 processes. Now the next thing
> > we want to
> > use is another part of Amber called Replica Exchange. It basically is
> > different arguments to the sander.MPI program. When I run this part of
> > the program I end up with the following results:
> >
> > Error: specified more groups ( 8 ) than the number of
> > processors (
> > 1 ) !
> > [unset]: aborting job:
> > application called MPI_Abort(MPI_COMM_WORLD, 1) - process 0
> > Error: specified more groups ( 8 ) than the number of
> > processors (
> > 1 ) !
> > [unset]: aborting job:
> > application called MPI_Abort(MPI_COMM_WORLD, 1) - process 0
> > Error: specified more groups ( 8 ) than the number of
> > processors (
> > 1 ) !
> >
> >
> > Now I realize I should be posting to the Amber list as this appears to
> > be an Amber related problem. I myself would tend to believe that. But
> > what I can't explain is why when I change to the original version of
> > mpirun that the program runs fine using the exact same files.
> >
> > <.... snip....>
> >
> >
> > So does this mean that Amber9 isn't working properly with the OSC
> > version of mpiexec? Which combinations of amber and mpi work best on a
> > torque beowulf cluster? Thanks in advance for any advice.
> >
> > -Steve
> >
> >
> > --------------------------------------------------------------
> > ---------
> > The AMBER Mail Reflector
> > To post, send mail to amber_at_scripps.edu
> > To unsubscribe, send "unsubscribe amber" to majordomo_at_scripps.edu
> >
>
>
> -----------------------------------------------------------------------
> The AMBER Mail Reflector
> To post, send mail to amber_at_scripps.edu
> To unsubscribe, send "unsubscribe amber" to majordomo_at_scripps.edu

-----------------------------------------------------------------------
The AMBER Mail Reflector
To post, send mail to amber_at_scripps.edu
To unsubscribe, send "unsubscribe amber" to majordomo_at_scripps.edu