AMBER Archive (2005)

Subject: AMBER: Sander Restart Problem

From: Garhan Attebury (attebury_at_cse.unl.edu)
Date: Tue May 24 2005 - 10:07:49 CDT


I have been restarting many sander jobs lately (some failed, some simply
exceeding their allotted queuing system time) and have noticed when
combining the .mdcrd files with ptraj that some of my expected data is
missing. More specifically, I use ptraj to combine the .mdcrd files and
output a bunch of pdb's as in the following ptraj input:

  trajin ../results/ml2_14_labeled_md2.mdcrd
  trajin ../results/ml2_14_labeled_md2_rst1.mdcrd
  trajin ../results/ml2_14_labeled_md2_rst2.mdcrd
  trajin ../results/ml2_14_labeled_md2_rst3.mdcrd
  trajout ml2_14_labeled_coords pdb
  center :1-30
  image familiar
  strip :WAT
  go

I'm expecting there to be 10000 pdb files output with this, and that's
exactly what I get when the first md2 run runs to completion without
needing restarts (eg: I only have one trajin line). When I have to
restart and combine as shown above, I get less than 10000 pdb files (for
this particular example I got 9905 pdb files). The relevant parts of
the starting md2 input are:

  irest = 1, ntx = 7,
  nstlim = 1000000, dt = 0.002,
  ntpr = 100, ntwx = 100, ntwr = 1000

As you can see this should give me 10000 points (or pdb files in my
case). If for example this first md2 run fails at 394.600ps
(NSTEP=187300) I find the top of the restart file reads 0.3940000E+03
(or just 394.000ps). This is exactly what I'd expect as 394.000 is the
last multiple of ntwr=1000 before 394.600ps. Supposing I want a total
time of 2ns, I find my new ntslim by doing 394/0.002 = 197000. Then
doing 1000000 - 197000 = 803000 I find that my new ntslim should be
803000 steps to get me up to 2ns. The new md2_rst1.in file is then:

  irest = 1, ntx = 7,
  nstlim = 803000, dt = 0.002,
  ntpr = 100, ntwx = 100, ntwr = 1000

I start the new sander job doing:

  sander -O -i ml2_14_labeled_md2_rst1.in -o
ml2_14_labeled_md2_rst1.out -p ml2_14_labeled_wat.prmtop -c
ml2_14_labeled_md2.rst -r ml2_14_labeled_md2_rst1.rst -x
ml2_14_labeled_md2_rst1.mdcrd

... and everything runs along as I would expect. Even the first entry
in the output file for the new restart indicates the simulation is at
time 394.200ps (right where it should be). However (at least when doing
this restart process a few more times) I find that combining all the
.mdcrd files together in the end does NOT give me a complete 2ns time
and 10000 pdb files and instead gives me 9905 pdb files. Also, using
ptraj to simply combine the files and viewing the simulation in VMD I
can visibly see "jumps/jerks" at the points where I restarted.

Is my method of restarting correct? Anyone have any ideas why this is
happening? My goal is to be able to do these restarts but still get the
same data (10000 pdb files) as if the initial md2 run never failed and
just ran to completion itself.

Also, I know ntx=7 is "old" and ntx=5 is the same thing, but I doubt
that's causing the problems. Finally, this is not an isolated
incident. I have had this exact problem with many other simulations
using this same method.

Thanks for any insight!

- Garhan Attebury
-----------------------------------------------------------------------
The AMBER Mail Reflector
To post, send mail to amber_at_scripps.edu
To unsubscribe, send "unsubscribe amber" to majordomo_at_scripps.edu