AMBER Archive (2008)
Subject: RE: AMBER: Amber9 with MPICH2 failure at runtime

From: Sasha Buzko (obuzko_at_ucla.edu)
Date: Wed Apr 16 2008 - 19:06:14 CDT

Next message: David A. Case: "Re: AMBER: Amber9 with MPICH2 failure at runtime"
Previous message: Ross Walker: "RE: AMBER: Amber9 with MPICH2 failure at runtime"
In reply to: Ross Walker: "RE: AMBER: Amber9 with MPICH2 failure at runtime"
Next in thread: David A. Case: "Re: AMBER: Amber9 with MPICH2 failure at runtime"
Reply: David A. Case: "Re: AMBER: Amber9 with MPICH2 failure at runtime"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

Thanks, Ross.
We have a pretty decent switch (Cisco), but other than that, the setup
is far from optimal. Not sure what we can do in the near term (probably,
close to nothing).

Speaking of pmemd, I did compile it and tried to run it on the same
example from the tutorial. For some reason, I only get one pmemd process
on one node and it just hangs there using up 100% of the cpu/core but
not writing any output. Is there anything that could be missed during
the compilation?
I'm not so sure about the command line arguments either, since there is
virtually no documentation regarding running pmemd. I used the same
command line as for sander.MPI. It ran for half an hour (the whole thing
took 3 minutes on sander.MPI) and still produced nothing except the
beginning header of the out file with the input data. And that is
happening on one node only using mpich2 (mpd is started only on that
node).
Do you have any suggestions?

Thanks in advance

Sasha

On Wed, 2008-04-16 at 16:47 -0700, Ross Walker wrote:

> 
>
> Hi Sasha,
>
> You could try PMEMD, it can help a bit with gigabit ethernet but it
> will really depend on the size of the system you are running. The
> bigger the simulation the better it is likely to scale. Note PMEMD
> works quite well with a cross over cable - generally it is cheap
> gigabit switches that kill you.
>
> See if the switch supports flow control - if it does, turn it on,
> since that will help stop packet loss. Otherwise you are really just
> fighting the laws of physics - especially when you have multiple cores
> in one box but only one gigabit link per box. And even worse if you
> also use that link for NFS traffic.
>
> Something like Myrinet or Infiniband is really the only option these
> days. Or get an xRAC allocation from NSF to use "real" ;-) machines.
>
> All the best
> Ross
>
>
> /\
> \/
> |\oss Walker
>
> | Assistant Research Professor |
> | San Diego Supercomputer Center |
> | Tel: +1 858 822 0854 | EMail:- ross_at_rosswalker.co.uk |
> | http://www.rosswalker.co.uk | PGP Key available on request |
>
> Note: Electronic Mail is not secure, has no guarantee of delivery, may
> not be read every day, and should not be used for urgent or sensitive
> issues.
>
>
>
>
>
>
>
> ______________________________________________________________
> From: owner-amber_at_scripps.edu [mailto:owner-amber_at_scripps.edu]
> On Behalf Of Sasha Buzko
> Sent: Wednesday, April 16, 2008 15:43
> To: amber_at_scripps.edu
> Subject: RE: AMBER: Amber9 with MPICH2 failure at runtime
>
>
>
>
> Thanks, Ross.
> I guess what caused my suspicions is that the serial version
> didn't blow up, but completed the trajectory with no incident.
> Anyway, I ran the 12A cutoff bounded version, and that
> worked.
> I do have a performance-related question, though. Our cluster
> was initially designed for docking jobs that are easily
> splittable and the nodes don't need to have fast interconnects
> since there is no cross-talk. So the connections are gigabit
> ethernet. I noticed that running the same example using
> sander.MPI on 4 nodes takes about twice the time it did on
> one. Do you think using PMEMD could somehow alleviate this
> issue or is it hopeless to run amber without infiniband
> interconnects?
>
> Thanks
>
> Sasha
>
>
>
> On Wed, 2008-04-16 at 15:05 -0700, Ross Walker wrote:
>
> > 
> > Hi Sasha
> > This looks perfectly okay to meet - read the next part of
> > the tutorial and it will explain why things blow up. The
> > serial one should blow up as well - will just take longer
> > (in wallclock time) to reach that point.
> > All the best
> > Ross
> >
> > /\
> > \/
> > |\oss Walker
> >
> > | Assistant Research Professor |
> > | San Diego Supercomputer Center |
> > | Tel: +1 858 822 0854 | EMail:- ross_at_rosswalker.co.uk |
> > | http://www.rosswalker.co.uk | PGP Key available on request
> > |
> >
> > Note: Electronic Mail is not secure, has no guarantee of
> > delivery, may not be read every day, and should not be used
> > for urgent or sensitive issues.
> >
> >
> >
> >
> > ____________________________________________________
> >
> > From: owner-amber_at_scripps.edu
> > [mailto:owner-amber_at_scripps.edu] On Behalf Of Sasha
> > Buzko
> > Sent: Wednesday, April 16, 2008 14:43
> > To: amber_at_scripps.edu
> > Subject: AMBER: Amber9 with MPICH2 failure at
> > runtime
> >
> >
> >
> > Hi,
> > I installed and tested MPICH2 on several cluster
> > nodes, as well as compiled amber9 with MKL support
> > and static linking. make test.parallel went fine,
> > with the exception of a couple of possible failures
> > (didn't follow up on those yet).
> > To test further, I used an example from an Amber
> > tutorial (piece of DNA). When run as a serial Amber,
> > all works fine and produces expected output. The
> > parallel version, however, fails even when run on a
> > single node (one entry in the mpd.hosts file). The
> > output is below. I did run the resulting trajectory
> > using Sirius, and it looked fine, except that it's
> > incomplete, as opposed to the serial version output.
> > Do you have any suggestions as to why this might be
> > happening in the parallel version?
> >
> > Thank you
> >
> > Sasha
> >
> >
> > [sasha_at_node6 test]$ mpiexec -n 4
> > $AMBERHOME/exe/sander.MPI -O
> > -i /data/apps/amber/test/polyAT_vac_md1_nocut.in
> > -o /data/apps/amber/test/polyAT_vac_md1_nocut_mpich2.out -c /data/apps/amber/test/polyAT_vac_init_min.rst -p /data/apps/amber/test/polyAT_vac.prmtop -r /data/apps/amber/test/polyAT_vac_md1_nocut_mpich2.rst -x /data/apps/amber/test/polyAT_vac_md1_nocut_mpich2.mdcrd
> > application called MPI_Abort(MPI_COMM_WORLD, 1) -
> > process 0[cli_0]: aborting job:
> > application called MPI_Abort(MPI_COMM_WORLD, 1) -
> > process 0
> > application called MPI_Abort(MPI_COMM_WORLD, 1) -
> > process 2[cli_2]: aborting job:
> > application called MPI_Abort(MPI_COMM_WORLD, 1) -
> > process 2
> > application called MPI_Abort(MPI_COMM_WORLD, 1) -
> > process 3[cli_3]: aborting job:
> > application called MPI_Abort(MPI_COMM_WORLD, 1) -
> > process 3
> > application called MPI_Abort(MPI_COMM_WORLD, 1) -
> > process 1[cli_1]: aborting job:
> > application called MPI_Abort(MPI_COMM_WORLD, 1) -
> > process 1
> > Frac coord min, max: -2.111647559080276E-005
> > 0.999587572668685
> > Frac coord min, max: -2.111647559080276E-005
> > 0.999587572668685
> > The system has extended beyond
> > The system has extended beyond
> > the extent of the virtual box.
> > Restarting sander will recalculate
> > a new virtual box with 30 Angstroms
> > extra on each side, if there is a
> > the extent of the virtual box.
> > restart file for this configuration.
> > SANDER BOMB in subroutine Routine: map_coords
> > (ew_force.f)
> > Atom out of bounds. If a restart has been written,
> > Restarting sander will recalculate
> > restarting should resolve the error
> > a new virtual box with 30 Angstroms
> > Frac coord min, max: -2.111647559080276E-005
> > 0.999587572668685
> > The system has extended beyond
> > the extent of the virtual box.
> > Restarting sander will recalculate
> > a new virtual box with 30 Angstroms
> > extra on each side, if there is a
> > restart file for this configuration.
> > SANDER BOMB in subroutine Routine: map_coords
> > (ew_force.f)
> > Atom out of bounds. If a restart has been written,
> > restarting should resolve the error
> > extra on each side, if there is a
> > restart file for this configuration.
> > SANDER BOMB in subroutine Routine: map_coords
> > (ew_force.f)
> > Atom out of bounds. If a restart has been written,
> > restarting should resolve the error
> > rank 2 in job 2 node6.abicluster_39939 caused
> > collective abort of all ranks
> > exit status of rank 2: return code 1
> > rank 0 in job 2 node6.abicluster_39939 caused
> > collective abort of all ranks
> > exit status of rank 0: killed by signal 9
> >

-----------------------------------------------------------------------
The AMBER Mail Reflector
To post, send mail to amber_at_scripps.edu
To unsubscribe, send "unsubscribe amber" to majordomo_at_scripps.edu

Next message: David A. Case: "Re: AMBER: Amber9 with MPICH2 failure at runtime"
Previous message: Ross Walker: "RE: AMBER: Amber9 with MPICH2 failure at runtime"
In reply to: Ross Walker: "RE: AMBER: Amber9 with MPICH2 failure at runtime"
Next in thread: David A. Case: "Re: AMBER: Amber9 with MPICH2 failure at runtime"
Reply: David A. Case: "Re: AMBER: Amber9 with MPICH2 failure at runtime"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

AMBER Archive (2008)Subject: RE: AMBER: Amber9 with MPICH2 failure at runtime

AMBER Archive (2008)
Subject: RE: AMBER: Amber9 with MPICH2 failure at runtime