AMBER Archive (2008)

Subject: RE: Fw: RE: AMBER: MKL libraries/Amber10

From: Ross Walker (ross_at_rosswalker.co.uk)
Date: Mon May 19 2008 - 16:25:45 CDT


Hi Cenk,

I think something is screwy with the runs you sent me since the following
from the output does not make sense:

| Atom division among processors:
| 0 41778

|QMMM: Running QMMM calculation in parallel mode on 1 threads.
|QMMM: All atom division among threads:
|QMMM: Start End Count
|QMMM: Thread( 0): 1-> 41778 ( 41778)

|QMMM: Quantum atom + link atom division among threads:
|QMMM: Start End Count
|QMMM: Thread( 0): 1-> 66 ( 66)

     Sum of charges from parm topology file = -0.00000011
     Forcing neutrality...
| Running AMBER/MPI version on 1 nodes

This implies that this was running the MPI version of the code but only on a
single processor. Can you double check the arguments to mpirun and make sure
it is picking up the number of processors correctly. The code should still
not segfault even when run with 'mpirun -np 1' but I'd like to know if the
crash is specific to running the parallel code on a single processor or it
occurs on any number of processors in parallel.

Note this bug is almost certainly occurring because the dspev routine is
failing to converge and it is then falling back on the internal diagonalizer
as a fail safe. This is MKL version (and processor type) dependent and so
you may find it works fine on one machine, say your desktop, because the
dspev routine runs fine and yet on another machine, say your cluster, it
crashes because here the dspev routine exited with an error.

Essentially it looks like the 'failsafe' code is 'failing deadly' :-(.
Unfortunately on all of my machines dspev works like a charm :-(.

All the best
Ross

> -----Original Message-----
> From: owner-amber_at_scripps.edu [mailto:owner-amber_at_scripps.edu] On Behalf
> Of Cenk (Jenk) Andac
> Sent: Monday, May 19, 2008 1:41 PM
> To: amber_at_scripps.edu
> Subject: RE: Fw: RE: AMBER: MKL libraries/Amber10
>
> Hi Ross,
>
> Thanks for responding. As you suggested I have rerun the 1NLN_dspev test
> using the "verbosity=4" option in the input file and am sending you the
> output "mdout.1NLN_dspev.verb4" as an attachment. Just to let you know
> that I sent a PBS job to the server to assign for 4 CPUs on 4 different
> nodes to rerun the test.
>
> In addition to this, I ran another 1NLN_dspev test using the 2 CPUs of our
> server only (that is, no external nodes was used to run the second test).
> It passed the test this time. Please see the second attachment
> "mdout.1NLN_dspev.server" for an output for the second test.
>
> regards,
>
> jenk.
>
>
>
> --- On Mon, 5/19/08, Ross Walker <ross_at_rosswalker.co.uk> wrote:
>
> > From: Ross Walker <ross_at_rosswalker.co.uk>
> > Subject: RE: Fw: RE: AMBER: MKL libraries/Amber10
> > To: amber_at_scripps.edu
> > Date: Monday, May 19, 2008, 12:19 PM
> > Hi Cenk,
> >
> > I need to see if I can reproduce this problem myself to
> > work out what is
> > going on and fix the bug. I believe it likely stems from a
> > variable
> > somewhere not being correctly copied to all of the nodes.
> >
> > In the meantime it would really help if you could try to
> > get some more
> > debugging info out of the run for me.
> >
> > In a scratch directory can you try manually running the
> > dspev job for me.
> > The commands are effectively:
> >
> > cd ~/
> > mkdir foo
> > cd foo
> > cp
> > $AMBERHOME/test/qmmm2/1NLN_test_diagonalizers/1NLN_15A_solv.prmtop
> > ./prmtop
> > cp
> > $AMBERHOME/test/qmmm2/1NLN_test_diagonalizers/1NLN_15A_solv_min.rst
> > ./inpcrd
> >
> > Then create a mdin file with the following contents:
> >
> > foo
> > &cntrl
> > imin=0, irest=0, ntx=1,
> > nstlim=4,dt=0.002,
> > temp0=300.0,tempi=300.0,
> > ntc=2, ntf=2,
> > ntb=1,
> > cut=8.0,
> > ntt=1,
> > ntpr=1,
> > ifqnt=1
> > /
> > &ewald use_pme=1 /
> > &qmmm
> > iqmatoms=1585,1586,1587,1588,1589,1590,
> >
> > 1591,1592,1593,1594,1595,1596,1597,1598,1599,1600,1601,1602,
> >
> > 1603,1604,1605,1606,1607,1608,1609,1610,1611,1612,1613,1614,
> > 1615,1616,1617,1618,
> > 3348,3349,3350,3351,3352,3353,3354,3355,
> >
> > 3356,3357,3358,3359,3360,3361,3362,3363,3364,3365,3366,3367,
> > 3368,3369,3370,3371,3372,3373,3374,3375,3376,
> > qm_theory='AM1', qmcharge=0,
> > verbosity=0, qmshake=1,
> > qmcut=8.0, qm_ewald=1, qm_pme=1,
> > verbosity=4,writepdb=0,adjust_q=2, diag_routine=2,
> > /
> >
> > Then can you try running this as you would in parallel.
> > I.e. something like:
> >
> > mpirun -np 2 $AMBERHOME/exe/sander.MPI -O
> >
> > Then please send me the mdout file so I can see exactly
> > where it crashes.
> >
> > Thanks,
> >
> > All the best
> > Ross
> >
> > > -----Original Message-----
> > > From: owner-amber_at_scripps.edu
> > [mailto:owner-amber_at_scripps.edu] On Behalf
> > > Of Cenk (Jenk) Andac
> > > Sent: Monday, May 19, 2008 9:48 AM
> > > To: amber_at_scripps.edu
> > > Subject: RE: Fw: RE: AMBER: MKL libraries/Amber10
> > >
> > > Hi Ross,
> > >
> > > I think I have a similar problem. Although the
> > parallel static
> > > installation of AMBER10 went well (with all bugfixes
> > applied) and it
> > > passed MM tests, it failed in QMMM tests at step
> > 1NLN_dspev. Attached are
> > > my output files reqarding the QMMM tests. I would
> > appreciate it if you let
> > > me know if there is a workaround for the fail
> > messages.
> > >
> > > cheers,
> > >
> > > jenk.
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > > --- On Mon, 5/19/08, Ross Walker
> > <ross_at_rosswalker.co.uk> wrote:
> > >
> > > > From: Ross Walker <ross_at_rosswalker.co.uk>
> > > > Subject: RE: Fw: RE: AMBER: MKL libraries/Amber10
> > > > To: amber_at_scripps.edu
> > > > Date: Monday, May 19, 2008, 10:41 AM
> > > > Hi Francesco,
> > > >
> > > > You only need to apply the bugfixes once to the
> > source
> > > > tree. So I assume if
> > > > you are using the same amber installation
> > (directory
> > > > structure) to compile
> > > > both serial and parallel then you are fine.
> > However,
> > > > I'd still like to try
> > > > and track down what is wrong with the QM/MM in
> > parallel.
> > > > Can you try running
> > > > the test case again and see if it crashes at the
> > same
> > > > point.
> > > >
> > > > If it does can you then please send me the output
> > file
> > > > $AMBERHOME/test/qmmm2/mdout.1NLN_dspev
> > > >
> > > > Thanks,
> > > > Ross
> > > >
> > > > > -----Original Message-----
> > > > > From: owner-amber_at_scripps.edu
> > > > [mailto:owner-amber_at_scripps.edu] On Behalf
> > > > > Of Francesco Pietra
> > > > > Sent: Monday, May 19, 2008 8:25 AM
> > > > > To: amber_at_scripps.edu
> > > > > Subject: Re: Fw: RE: AMBER: MKL
> > libraries/Amber10
> > > > >
> > > > > Hi:
> > > > >
> > > > > I applied bugfix 1-3 (April 2008) before
> > compiling
> > > > serial, and all tests
> > > > > PASSED, including those for the QMMM part.
> > > > >
> > > > > Then, I compiled parallel by just commanding
> > > > > make parallel
> > > > > .?configure ...
> > > > > make parallel
> > > > >
> > > > > I had not imagined that parallel compilation
> > should
> > > > have been preceded by
> > > > > the application of bugfix, as implied in
> > your mail.
> > > > Perhaps that need
> > > > > could be specified in the manual online.
> > > > >
> > > > > I can't try immediately as the machine
> > is busy
> > > > with a docking procedure.
> > > > >
> > > > > Thanks
> > > > > francesco pietra
> > > > >
> > > > > --- On Mon, 5/19/08, Gustavo Seabra
> > > > <gustavo.seabra_at_gmail.com> wrote:
> > > > >
> > > > > > From: Gustavo Seabra
> > > > <gustavo.seabra_at_gmail.com>
> > > > > > Subject: Re: Fw: RE: AMBER: MKL
> > libraries/Amber10
> > > > > > To: amber_at_scripps.edu
> > > > > > Date: Monday, May 19, 2008, 7:18 AM
> > > > > > > With immediately subsequent QMMM
> > test, after
> > > > some tests
> > > > > > PASSED, there was some problem with
> > mpirun. Maybe
> > > > someone is
> > > > > > interested in looking at the output of
> > the
> > > > compilation
> > > > > > (attached). To this concern, also the
> > (renamed)
> > > > config file
> > > > > > is attached.
> > > > > >
> > > > > > You need to apply the bugfixes before
> > compiling.
> > > > > > Specifically, your
> > > > > > problem with the QM/MM testing should
> > be solved
> > > > with bugfix
> > > > > > 3. See:
> > > > > > http://www.ambermd.org/bugfixes10.html
> > > > > >
> > > > > > Gustavo.
> > > > > >
> > > >
> > -----------------------------------------------------------------------
> > > > > > The AMBER Mail Reflector
> > > > > > To post, send mail to amber_at_scripps.edu
> > > > > > To unsubscribe, send "unsubscribe
> > > > amber" (in the
> > > > > > *body* of the email)
> > > > > > to majordomo_at_scripps.edu
> > > > >
> > > > >
> > > > >
> > > > >
> > > >
> > -----------------------------------------------------------------------
> > > > > The AMBER Mail Reflector
> > > > > To post, send mail to amber_at_scripps.edu
> > > > > To unsubscribe, send "unsubscribe
> > amber" (in
> > > > the *body* of the email)
> > > > > to majordomo_at_scripps.edu
> > > >
> > > >
> > -----------------------------------------------------------------------
> > > > The AMBER Mail Reflector
> > > > To post, send mail to amber_at_scripps.edu
> > > > To unsubscribe, send "unsubscribe
> > amber" (in the
> > > > *body* of the email)
> > > > to majordomo_at_scripps.edu
> > >
> > >
> > >
> >
> > -----------------------------------------------------------------------
> > The AMBER Mail Reflector
> > To post, send mail to amber_at_scripps.edu
> > To unsubscribe, send "unsubscribe amber" (in the
> > *body* of the email)
> > to majordomo_at_scripps.edu
>
>
>

-----------------------------------------------------------------------
The AMBER Mail Reflector
To post, send mail to amber_at_scripps.edu
To unsubscribe, send "unsubscribe amber" (in the *body* of the email)
      to majordomo_at_scripps.edu