AMBER Archive (2009)
Subject: Re: [AMBER] Error in PMEMD run
From: Robert Duke (rduke_at_email.unc.edu)
Date: Fri May 08 2009 - 20:26:37 CDT
The random seed needs to change with each restart of the simulation - ie.,
you don't want to use the same sequence of random numbers twice, because
that is effectively not random. I will defer to others, like Dave Case, to
give you better pointers on langevin dynamics; if I make another
contribution in this area, it will probably be to fix the poor scaling for
the random number generation (seems worthwhile to me; just have to get to
it). Anyway, good luck with your simulations, and hopefully the performance
numbers you get with the standard benchmarks will give you a better idea of
what is happening with your machines (I remain somewhat disturbed that the
value at 32 nodes deteriorates so badly - I do suspect something else is
going on there, but there are too many unknowns for me to be sure).
Best Regards - Bob
----- Original Message -----
From: "Marek Malý" <maly_at_sci.ujep.cz>
To: "AMBER Mailing List" <amber_at_ambermd.org>
Sent: Friday, May 08, 2009 9:05 PM
Subject: Re: [AMBER] Error in PMEMD run
it seems to me that you never sleep :))
Anyway thanks again for your quickfire answer !
I think that I have already consumed plethora of your time so let me just
last notices regarding to ntt setting, since it seems to me to be the most
OK, if I understand well ntt = 1 could be for explicit solvent pretty fine
(this is probably
not true for implicit solvent at least in Amber10 manual is this warning).
Regarding to ntt = 3, there is also probably important gamma_ln which
defines collision frequency.
If I understand well from the physical point of view is probably slightly
more reliable ntt = 3
choice but with proper gamma_ln value (which could be system dependend).
The important thing
is to periodically change random seed (ig value). How often change ig
parameter is again probably dependend
on gamma_ln setting (is there any formula which can give me an recommended
ig frequency change as a function of gamma_ln ?).
If I understand well the mentioned artifacts appeares because of finite
of used pseudo-random number generator. OK, on the other hand the cost for
maybe more reliable/sofisticated thermostat
(ntt=3 versus ntt=1) is worse scaling regarding to (32, 64, 128 ... cpus).
Am I right ?
OK, just the last question. Just if you know ...
We are speaking here ntt=3 could be more reliable than ntt = 1 ... but
what are the criterions
for this judgements ? In other words, let say that I have molecular system
XY and now I would like
to try some tests to learn what temperature control is the best for this
system - just from the physical reliability
point of view (I don't care now about CPU time).
So can you recommend some tests which can for example help me to estimate
the optimal "gamma_ln"
if I choose ntt = 3, or optimal "vrand" if I choose ntt = 2. Tests which
finally help me
to choose between ntt = 1 , ntt = 2 with vrand_optimal , ntt = 3 with
If you don't know please do not waste time I can probably find some
particular answers in articles
cited in Amber10 manual ( e.g. , , , ).
So thank you again for all !
Dne Sat, 09 May 2009 02:01:23 +0200 Robert Duke <rduke_at_email.unc.edu>
> Hi Marek,
> I glanced at the dif's but I will let Ross or somebody more used to
> looking at the strange things that may happen in the full suite comment
> on them. If pmemd passed all it's tests, then it should be good. At 16
> processors, I guess I am not greatly surprised that there are not huge
> differences in performance - you expect things to be hitting you more as
> go 32, 48, 64... So the biggest difference you see is the ntt 3 vs ntt 1,
> and that I would expect. Where you will see the cut make more of a
> difference, honestly, is at relatively low processor count. What happens
> is that the recip space and data distrib costs start going up as you
> scale, while the direct space costs scale reasonably. I think the less
> frequent trajectory running slower is a matter of your test times being
> too short. Also, is anything else running on this cluster? Any chance,
> whatsoever, that there are other jobs running on the actual nodes you are
> using? That also makes things sort of poor on performance and unreliable.
> ON ntt 3 vs ntt 1. Well, I am working with a bunch of guys that still
> use ntt 1. There are theoretical objections that can be raised about the
> quality of results with this thermostat. With ntt 2 or 3, if you don't
> change the random seed at each restart, then your results can have
> serious artifacts (another point of some contention). So all sorts of
> wild things were happening, it seemed to me, when these thermostats were
> first introduced, (3 in particular), but they were reputed to equilibrate
> temperature better. They probably do; you just have to be sure to use a
> different random seed with each restart. I have steered clear of them
> because all of our work went okay with the older ntt 1, because there was
> this period of bad results, probably due to not resetting the random
> seed, and finally, because if you really try to scale up, the random
> number generation methods will start eating up more and more of your time
> and keep you from scaling very well. I expect at 32 cpu it is more
> noticeable. It is not a huge effect probably until 64-128+ or so, but
> that is an area that is interesting to me. So that's the history;
> probably if you don't routinely want to run on a ton of cpu's and change
> the seed religiously, there is virtue in ntt 3, but many usec has been
> piled up with ntt 1 over the last decade. Bear in mind, I am more of a
> computer guy than an MD guy, though I am trained in both computer science
> and the sciences; still my focus in all this is more providing the tools
> so you all can do the simulations, not in doing them myself.
> Okay, last point. Please just benchmark some with factor ix, and see how
> what you get compares to what other folks are getting on their clusters.
> So the goal here is to try to sort out if there are any problems with
> your hardware or software in the performance area. Without comparing
> something for which we have data elsewhere, we can't really tell...
> Best Regards - Bob
> ----- Original Message ----- From: "Marek Malý" <maly_at_sci.ujep.cz>
> To: "AMBER Mailing List" <amber_at_ambermd.org>
> Sent: Friday, May 08, 2009 7:37 PM
> Subject: Re: [AMBER] Error in PMEMD run
> Dear Bob,
> thanks a lot for your analysis !
> I made some tests (ONLY PMEMD) regarding your hypothesis.
> Just the same short test like previous ones, with the same input files,
> 1000 steps.
> In each additional test I just changed 1 parameter (from my original
> to see it's influence on CPU time. Regarding to node/cpus setting I have
> tested only
> one case : 2/8cpus = 16 single processors job where I am using all 8
> single cpus per node.
> my original setting : 85 s
> cutt = 8 : 84 s
> ntpr, ntwx = 1000 : 87 s ( strange but true :)) )
> ntt = 1 : 78 s
> ntt = 2, vrand =1000 : 83 s
> ntt = 3, gamma_ln = 0 : 82 s
> t0 = 300 : 87 s
> As you can see there are just little changes comparing to my original
> setting which is listed below (in your last replay).
> Of course it is question how the influence of tested parameters changes
> another node/cpu configurations (4/8 cpu, 4/4 cpu ...) or
> in longer test, like 5000 steps which you recommeded ...
> Anyway in this short test I set originally ntpr, ntwx to 200 but ofcourse
> in real simulation they are much bigger (5000).
> Regarding ntt it seems to me that you do not recommend ntt=3 (at least
> explicit solvent) so what is your favourite choice
> for this type of simulation ?
> OK, and now back to the reliability question.
> I have made all the tests with my "ifort 11" compilation of Amber and
> 10.1.019 compilation of PMEMD which just uses new cc and MKL libs.
> Here are the results:
> #1 - AmberTools - I think OK
> #2 - AmberSerial_MM - I think OK
> #3 - AmberSerial_QMMM - I think OK
> (please see the attached files)
> #4 - AmberParallel_MM
> I made it on full node = 8 single cpus
> Here is my script to run this test:
> mpdboot -f ~/.mpd11.hosts -n $NODES
> export DO_PARALLEL="mpiexec -np 8"
> make test.parallel.MM
> big part of the test passed without any problems, but after while it got
> I have waited cca 45 min, for this time period whole processors were busy
> for 100%
> all the time see this "top" list:
> PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
> 472 mmaly 20 0 153m 14m 5672 R 101 0.1 43:25.90 sander.MPI
> 468 mmaly 20 0 153m 14m 5668 R 100 0.1 46:12.26 sander.MPI
> 469 mmaly 20 0 157m 16m 7728 R 100 0.1 45:38.51 sander.MPI
> 470 mmaly 20 0 153m 14m 5672 R 100 0.1 46:12.23 sander.MPI
> 467 mmaly 20 0 157m 16m 7736 R 100 0.1 45:57.42 sander.MPI
> 473 mmaly 20 0 159m 16m 7728 R 100 0.1 46:03.65 sander.MPI
> 466 mmaly 20 0 153m 14m 5664 R 99 0.1 45:57.73 sander.MPI
> 471 mmaly 20 0 157m 16m 7748 R 90 0.1 45:24.29 sander.MPI
> 512 mmaly 20 0 10740 1480 1032 R 0 0.0 0:01.18 top
> cd PIMD/part_cmd_water/restart && ./Run.cmdyn
> diffing cmd.out.save with cmd.out
> cd PIMD/part_rpmd_water && ./Run.rpmd
> diffing spcfw_rpmd.top.save with spcfw_rpmd.top
> diffing spcfw_rpmd.xyz.save with spcfw_rpmd.xyz
> diffing spcfw_rpmd.out.save with spcfw_rpmd.out
> cd ti_mass/pent_LES_PIMD && ./Run.pentadiene
> This test not set up for parallel
> cannot run in parallel with #residues < #pes
> make: Leaving directory `/home/mmaly/_applications/amber/test'
> cd PIMD/full_cmd_water/equilib && ./Run.full_cmd
> Testing Centroid MD <<<< - HERE IT GOT STUCKED
> so I had to kill this process since I do not believe that this test
> be longer on 8 CPUs that just several minutes ...
> Anyway relevant TEST_FAILURES file was created (please see attached
> #5 - AmberParallel_QMMM
> This test crashed very soon as you can see on the below listing:
> export TESTsander=/opt/amber/exe/sander.MPI; make test.sander.QMMM
> make: Entering directory `/home/mmaly/_applications/amber/test'
> cd qmmm2/xcrd_build_test/ && ./Run.oct_nma_imaged
> diffing mdout.oct_nma_imaged.save with mdout.oct_nma_imaged
> cd qmmm2/xcrd_build_test/ && ./Run.oct_nma_noimage
> diffing mdout.oct_nma_noimage.save with mdout.oct_nma_noimage
> cd qmmm2/xcrd_build_test/ && ./Run.ortho_qmewald0
> * NB pairs 145 185645 exceeds capacity ( 185750) 3
> SIZE OF NONBOND LIST = 185750
> SANDER BOMB in subroutine nonbond_list
> Non bond list overflow!
> check MAXPR in locmem.f
> [cli_3]: aborting job:
> application called MPI_Abort(MPI_COMM_WORLD, 1) - process 3
> rank 3 in job 3 enode11_56157 caused collective abort of all ranks
> exit status of rank 3: return code 1
> ./Run.ortho_qmewald0: Program error
> make: *** [test.sander.QMMM] Error 1
> make: Leaving directory `/home/mmaly/_applications/amber/test'
> make: *** [test.sander.QMMM.MPI] Error
> There is some problem with MAXPR, but as I learned (after seeing file
> locmem.f) this is not a typical constant but variable which
> is evaluated by the program it self, or am I wrong ?
> Anyway can I do something to prevent this error and to proceed whole
> AmberParallel_QMMM test ?
> #6 PMEMD test
> Absolutely without problems. All passed after while and no TEST_FAILURES
> file was created.
> Bob I would be very grateful if you can look into attached files and let
> say at least to indicate :)) if my instalation
> seems to be reliable or if will be better to compleet reinstalation using
> your recommended ifort 10.1.021 ...
> Thank you very much in advance !
> Dne Fri, 08 May 2009 21:44:08 +0200 Robert Duke <rduke_at_email.unc.edu>
>> Ah, now we are getting somewhere!
>> A 60000 atom system - that is fine.
>> Now, let's look at the mdin file you sent:
>> heat ras-raf
>> cut=10.0, ntb=2, ntp=1, taup=2.0,
>> ntpr=200, ntwx=200,
>> ntt=3, gamma_ln=2.0,
>> Here, things get interesting. Let's go through the potential problems
>> in the order they occur:
>> cut=10.0 - This is a really big cutoff for pme, generally unnecessary.
>> The default cut is 8 angstrom; you will run roughly twice as slow for
>> your direct space calcs with a cutoff this big. Not really a great idea
>> (some folks go to 9 angstrom to get a longer vdw interaction; with pmemd
>> you can actually just increase the vdw while leaving the electrostatic
>> cut at 8 and get better performance. Now the other thing - if you are
>> having trouble with scaling, larger cutoffs will slow you down even more
>> because there is more information interchange.
>> ntwx=200 - You are dumping a trajectory snapshot every 0.4 psec - this
>> is not outrageous, but is probably also a bit of overkill. You could
>> probably print every psec and be fine (ntwx=500). If your disk is at
>> all slow, this will hurt. It sounded like what your were doing on the
>> disks is okay, as long as there is not some screwy nfs mount issue
>> (sounds like there is not).
>> ntt=3 - AhHa! This is a langevin thermostat. There is a huge
>> inefficiency here, associated with random number generation. I don't
>> know how expensive it gets, but it does get expensive, and I view ntt 3
>> as not a production tool for this reason. Others undoubtedly disagree,
>> as lots of folks like this thermostat. BUT the way it is currently
>> implemented, it really kills scaling.
>> tem0=310. Additional motion at higher temp. More listbuilds. Less
>> efficient (but you are driving the dynamics further in less time).
>> Probably a very small effect.
>> nstlim=1000 - PMEMD is still adjusting the run parameters out to roughly
>> step 4000. So for higher scaling stuff, I typically do about 5000 steps
>> minimum to see what is going on.
>> - This stuff is at least some of the reason you are not scaling as well
>> as one might hope... The devil is in the details, and he can be a real
>> Best Regards - Bob
>> ----- Original Message ----- From: "Marek Malý" <maly_at_sci.ujep.cz>
>> To: "AMBER Mailing List" <amber_at_ambermd.org>
>> Sent: Friday, May 08, 2009 3:10 PM
>> Subject: Re: [AMBER] Error in PMEMD run
>> Hi Bob,
>> my testing system is composed of PPI dendrimer 4 gen + explicit wat,
>> total num. of atoms cca 60000.
>> Here are the input files for testing:
>> I know it is not a big system but for benchmark on 16-32 CPUs is OK I
>> think or am I wrong ?
>> For testing I used just 1000 steps from the equilibrium phase ( NPT
>> simulation see - equil_DEN_PPIp_D.in ).
>> Regarding to discs question.
>> Each node has his local harddrive (SATA 250 GB), so I run my jobs from
>> node listed in relevant .mpd.hosts file.
>> Let say that if I want to run my job on 2 nodes (for example 11 and 12 )
>> I go to local disc of the node 11 and run the job from it.
>> This local discs are not shared yet.
>> Regarding to MPI, we are using Intel MPI (actually version 3.2.0.011).
>> here are my config commands for compilation of parallel Amber/PMEMD:
>> ./configure_amber -intelmpi ifort (Parallel Amber)
>> ./configure linux_em64t ifort intelmpi (PMEMD)
>> We have 14 nodes in total each node = 2 x Intel Xeon Quad-core 5365 (
>> GHz) = 8 single CPUs
>> Nodes are connected using "Cisco InfiniBand".
>> So that's all what I can say about my testing system and our cluster.
>> Thanks for your time !
>> Dne Fri, 08 May 2009 20:24:35 +0200 Robert Duke <rduke_at_email.unc.edu>
>>> Yes, Ross makes points I was planning on making next. We need to know
>>> your benchmark. You should be running something like JAC, or even
>>> better yet, factor ix, from the benchmarks suite. Then you should
>>> convert your times to nsec/day and compare to some to the published
>>> values at www.ambermd.org to have a clue as to just how good or bad
>>> you are doing. Once you have a reasonable benchmark (not too small,
>>> balanced i/o, not asking for extra features that are known not to
>>> scale, etc etc), then we can look for other problems. Given a GOOD
>>> infiniband setup (high bandwidth, configured correctly, balance
>>> between pci express and the infiniband hca's, well-scaled infiniband
>>> switch layout, no noise from loose cables, etc etc etc), then the next
>>> likely source of grief is the disk. Are you all perhaps using an
>>> nfs-mounted volume, and even worse, one volume, not a parallel file
>>> system, being written to by multiple running jobs? Bad idea.
>>> Parallel jobs will hang like crazy waiting for the master to do disk
>>> i/o. Is mpi really set up correctly? The only way you know is if the
>>> setup has passed other benchmarks (I typically tell by comparison of
>>> pmemd on the candidate system to other systems, but believe me, mpi
>>> can really be screwed up pretty easily). Which mpi? OpenMPI is known
>>> to be bad with infiniband (I don't know if it is actually "good" with
>>> anything). Intel mpi is supposed to be good, but I have never tried
>>> to jump through all the configuration hoops. MVAPICH is pretty
>>> standard; once again, though, because I don't admin a system of this
>>> type, I have no idea how hard it is to get everything right. I am
>>> really sorry you are having so much "fun" with all this; I know it
>>> must be frustrating, but there is a reason bigger clusters get run by
>>> staff. By the way, how big is the cluster?
>>> Best Regards - Bob
>>> ----- Original Message ----- From: "Ross Walker"
>>> To: "'AMBER Mailing List'" <amber_at_ambermd.org>
>>> Sent: Friday, May 08, 2009 2:11 PM
>>> Subject: RE: [AMBER] Error in PMEMD run
>>> Hi Marek,
>>> I don't think I've seen anywhere what the actual simulation you are
>>> is. This will have a huge effect on parallel scalability. With
>>> and a 'reasonable' system size you should easily be able to get beyond
>>> nodes. Here are some numbers for the JAC NVE benchmark from the suite
>>> provided on http://ambermd.org/amber10.bench1.html
>>> This is for NCSA Abe which is Dual x Quad core clovertown (E5345
>>> 2.33GHz so
>>> very similar to your setup) and uses SDR infiniband.
>>> Using all 8 processors per node (time for benchmark in seconds):
>>> 8 ppn 8 cpu 364.09
>>> 8 ppn 16 cpu 202.65
>>> 8 ppn 24 cpu 155.12
>>> 8 ppn 32 cpu 123.63
>>> 8 ppn 64 cpu 111.82
>>> 8 ppn 96 cpu 91.87
>>> Using 4 processors per node (2 per socket):
>>> 4 ppn 8 cpu 317.07
>>> 4 ppn 16 cpu 178.95
>>> 4 ppn 24 cpu 134.10
>>> 4 ppn 32 cpu 105.25
>>> 4 ppn 64 cpu 83.28
>>> 4 ppn 96 cpu 67.73
>>> As you can see it is still scaling to 96 cpus (24 nodes at 4 threads
>>> node). So I think you must either be running an unreasonably small
>>> system to
>>> expect scaling in parallel or there is something very wrong with the
>>> of your computer.
>>> All the best
>>>> -----Original Message-----
>>>> From: amber-bounces_at_ambermd.org [mailto:amber-bounces_at_ambermd.org] On
>>>> Behalf Of Marek Malý
>>>> Sent: Friday, May 08, 2009 10:58 AM
>>>> To: AMBER Mailing List
>>>> Subject: Re: [AMBER] Error in PMEMD run
>>>> Hi Gustavo,
>>>> thanks for your suggestion but we have only 14 nodes in our cluster
>>>> node = 2 x Xeon Quad-core 5365 (3,00 GHz) = 8 single CPUs per node
>>>> connected with "Cisco InfiniBand").
>>>> If I allocate 8 nodes and I use just 2 CPUs per node for one my job it
>>>> means that 8x6 single CPUs = 48 will be wasted. In this
>>>> case I am sure that my colleagues will kill me :)) Moreover I do not
>>>> assume that 8/2CPU combination will have significantly better
>>>> performance that 2/8CPU at least in case of PMEMD.
>>>> But anyway, thank you for your opinion/experience !
>>>> Dne Fri, 08 May 2009 19:28:35 +0200 Gustavo Seabra
>>>> <gustavo.seabra_at_gmail.com> napsal/-a:
>>>> >> the best performance I have obtained in case of using combination
>>>> >> nodes
>>>> >> and 4 CPUs (from 8) per node.
>>>> > I don't know exactly what you have in your system, but I gather you
>>>> > are using 8core-nodes, and from it you got the best performance by
>>>> > leaving 4 cores idle. Is that correct?
>>>> > In this case, I would suggest that you go a bit further, and also
>>>> > using only 1 or 2 cores per node, i.e., leaving the remaining 6-7
>>>> > cores idle. So, for 16 MPI processes, try allocating 16 or 8 nodes.
>>>> > (I didn't see this case in your tests)
>>>> > AFAIK, The 8-core nodes are arranged in 2 4-core sockets, and the
>>>> > communication between core, that was already bad within the 4-cores
>>>> > the same socket, gets even worse when you need to get information
>>>> > between two sockets. Depending on your system, if you send 2
>>>> > to the same node, it may put all in the same socket or automatically
>>>> > split it one for each socket. You may also be able to tell it to
>>>> > sure that this gets split in to 1 process per socket. (Look into the
>>>> > mpirun flags.) From the tests we've run on those kind of machines,
>>>> > do get the best performance by leaving ALL BUT ONE core idle in each
>>>> > socket.
>>>> > Gustavo.
>>>> > _______________________________________________
>>>> > AMBER mailing list
>>>> > AMBER_at_ambermd.org
>>>> > http://lists.ambermd.org/mailman/listinfo/amber
>>>> > __________ Informace od NOD32 4051 (20090504) __________
>>>> > Tato zprava byla proverena antivirovym systemem NOD32.
>>>> > http://www.nod32.cz
>>>> Tato zpráva byla vytvořena převratným poštovním klientem Opery:
>>>> AMBER mailing list
>>> AMBER mailing list
>>> AMBER mailing list
>>> __________ Informace od NOD32 4051 (20090504) __________
>>> Tato zprava byla proverena antivirovym systemem NOD32.
Tato zpráva byla vytvořena převratným poštovním klientem Opery:
AMBER mailing list
AMBER mailing list