AMBER Archive (2007)
Subject: Re: AMBER: pmemd: same rst, different result?

From: Robert Duke (rduke_at_email.unc.edu)
Date: Mon Dec 17 2007 - 09:29:15 CST

Next message: Shuzhi Wang: "Re: AMBER: amber 9 sander crashed with "forrtl: severe (174): SIGSEGV, segmentation fault occurred""
Previous message: Carlos Simmerling: "Re: AMBER: MMTSB Toolset"
In reply to: Benjamin Juhl: "Re: AMBER: pmemd: same rst, different result?"
Next in thread: Anselm Horn: "Re: AMBER: pmemd: same rst, different result?"
Reply: Anselm Horn: "Re: AMBER: pmemd: same rst, different result?"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

I find it sort of interesting that this sort of question keeps coming up.
It worries me though that folks have not thought the points through on this,
and we have stated many times that there is an inherent indeterminacy in
longer (> 500 step) parallel md runs that while annoying, IS OF NO
SIGNIFICANCE WHATSOEVER. The archive article linked below does not get the
explanation completely correct, and leads off with a statement of amber md
code being single precision as an explanation - patently incorrect I assure
you. The code is ALL double precision - and double precision has a
precision of between 16 and 17 decimal digits (the intel and ibm chips
actually have an internal 80 bit representation that increases precision as
long as you stay in fp registers by something like a factor of a thousand
(off the top of my head)). So this means that every fp arithmetic operation
is off by roughly 1 part in 10**17 in amber. This is fairly high
precision... Now, we do take liberties in the direct force calcs in that we
use spline tables for calculation of direct forces and energies, and this
lowers the precision of those operations to as low as 11 decimal digits of
precision, but that's as bad as it gets, in terms of easily defined losses
of precision (there are lots of other things in the numerical methods that
will cause the number to be "not quite right"; heck, every transcendental
function you use has some of this). Okay, but back to the parallel problem.
A parallel system, correctly implemented, insures correct results in that
the correct operations will be done regardless of how the problem is
parallelized. However, a given ordering of equivalent operations is not
guaranteed; eg. a + b + c may be executed as (a + b) + c one time and as a +
(b + c) another time. Same result, right? NO, not in a floating point
number system, where at the limit of precision you may be off by 1 least
significant digit for equivalent operations done in different orders. So,
each step of dynamics in a 100K system takes about 1 second on an intel 3
GHz chip. That is 3 billion math operations roughly speaking (maybe really
half that, as there is a tremendous amount of integer bookkeeping and memory
moves, especially in parallel code). But anyway, there are roughly 3
billion operations, and for pmemd results are typically reproducible in
parallel runs for between 300 and 500 steps. So lets rough that out as
3*10**9 * 333 steps is approx 1*10**12 math operations before there is a
noticeable loss in precision in the 4th decimal place. So it all boils down
to the 1) the tremendous number of calculations done, 2) rounding error in
the last decimal place (16th-17th) that varies depending on the ordering of
equivalent operations, 3) network indeterminacy in order of addition of
forces, energies, virials (primarily). What do I mean by network
indeterminacy? There is no fixed order of i/o completion in a distributed
asynchronous operation - if one node stalls a little bit due to a
nondeterministic event (like some service timer pops and the OS spends
10,000 cycles checking on something, or the interconnect services a
communications request less rapidly due to loading by jobs on other nodes) -
then the order of equivalent operations may vary and the result of say an
addition of 3 numbers from 3 different nodes may differ by one least
significant digit in different runs. Where else does grief arise? Well
compile pmemd using fftw for your fft's, and run a little experiment with
the uniprocessor code. Do 10 runs at different times on your machine - all
about 300 steps on something like factor ix. Look at results; they MAY be
slightly different. Why? FFTW does short little timing tests to pick the
best of several different algorithms to use on a given run. The algorithm
it chooses may differ from run to run, because for some there is not much
difference, so depending on what the various services on you workstation are
doing, sometimes the really optimal algorithm loses and another is chosen.
Now, fast fourier transforms are numerical methods, and most numerical
methods are even more imprecise than the number system representation
(otherwise you are really wasting time...), so the different algorithms get
slightly different results. And I have not mentioned here two other
factors - 1) different machines, and 2) my use of at least a dozen different
codepaths (basically equivalent algorithms, but once again the order of
equivalent operations may vary) depending on what is optimal for a given
machine.

All these errors are considered insignificant because the real accuracy of
the results you are producing is way below the precision errors here.
Absolute reproducibility is essential for developing code reliably, and I
have to be sure that all significant codepaths get exercised within the
"window of reproducibility" I have to work with, but the significance of the
precision-related errors in terms of the results is nil. The general
handwaving done is to simply say we are "sampling different regions of phase
space", and we are. Now, if you really want to worry about things that
significantly affect whether your results are "good", then I would recommend
worrying about 1) the quality of parameterization of the forcefield - all
current forcefields are works in progress (and take a gander at the
precision of the empirical constants in the ff files), 2) step integration
issues - like timestep size, shake params, etc., and 3) sampling
statistics - did you really run long enough to know that the results you are
seeing are representative of the system? How do you know? (I leave it for
the folks that specialize in those areas to answer these sorts of questions
;-)).
Best Regards - Bob Duke

PS - the best article in my opinion, on floating point issues is:
David Goldberg,
"What Every Computer Scientist Should Know about Floating-Point Arithmetic",
ACM Computing Surveys 23 (1991), pp5-48

----- Original Message -----
From: "Benjamin Juhl" <itbbju_at_itb.uni-stuttgart.de>
To: <amber_at_scripps.edu>
Sent: Monday, December 17, 2007 5:25 AM
Subject: Re: AMBER: pmemd: same rst, different result?

> Hi Anselm,
>
> we observed the same thing recently and found that the explanaitions
> from the following mail in the amber mailing list archive covers about
> all your questions:
> http://amber.ch.ic.ac.uk/archive/200401/0103.html
>
> Benjamin
>
>
> Anselm Horn schrieb:
>> Dear all,
>>
>> we encountered a strange behaviour during a pmemd restart on our cluster
>> (infiniband network, job ran on 8 nodes with 4 processors each):
>>
>> A restart using a certain rst file crashed because there were numbers
>> too large for the output format and thus only asterics appeared (known
>> problem). When we reran the MD calculation starting from the previous
>> rst file that also served as input for the trajectory that produced the
>> corrupted rst file, pmemd then produced a valid rst file at the end of
>> this new trajectory. Thus, the same rst file on the same (homogeneous)
>> cluster with the same executable yielded a different result. How can
>> this be explained?
>>
>> Is it possible that the parallel implementation is the reason for this
>> non-deterministic behaviour - or does this simply indicate a potential
>> hardware problem on some of the machines?
>>
>> Any hints are welcome.
>>
>> Regards,
>>
>> Anselm Horn
>> Bioinformatik
>> Emil-Fischer-Zentrum
>> Friedrich-Alexander-Universität Erlangen-Nürnberg
>>
>>
>>
>>
>> -----------------------------------------------------------------------
>> The AMBER Mail Reflector
>> To post, send mail to amber_at_scripps.edu
>> To unsubscribe, send "unsubscribe amber" to majordomo_at_scripps.edu
>>
>>
>
> -----------------------------------------------------------------------
> The AMBER Mail Reflector
> To post, send mail to amber_at_scripps.edu
> To unsubscribe, send "unsubscribe amber" to majordomo_at_scripps.edu
>

-----------------------------------------------------------------------
The AMBER Mail Reflector
To post, send mail to amber_at_scripps.edu
To unsubscribe, send "unsubscribe amber" to majordomo_at_scripps.edu

Next message: Shuzhi Wang: "Re: AMBER: amber 9 sander crashed with "forrtl: severe (174): SIGSEGV, segmentation fault occurred""
Previous message: Carlos Simmerling: "Re: AMBER: MMTSB Toolset"
In reply to: Benjamin Juhl: "Re: AMBER: pmemd: same rst, different result?"
Next in thread: Anselm Horn: "Re: AMBER: pmemd: same rst, different result?"
Reply: Anselm Horn: "Re: AMBER: pmemd: same rst, different result?"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

AMBER Archive (2007)Subject: Re: AMBER: pmemd: same rst, different result?

AMBER Archive (2007)
Subject: Re: AMBER: pmemd: same rst, different result?