AMBER Archive (2007)Subject: Re: AMBER: pmemd: same rst, different result?
From: Robert Duke (rduke_at_email.unc.edu) 
Date: Mon Dec 17 2007 - 09:29:15 CST
 
 
 
 
I find it sort of interesting that this sort of question keeps coming up. 
 
It worries me though that folks have not thought the points through on this, 
 
and we have stated many times that there is an inherent indeterminacy in 
 
longer (> 500 step) parallel md runs that while annoying, IS OF NO 
 
SIGNIFICANCE WHATSOEVER. The archive article linked below does not get the 
 
explanation completely correct, and leads off with a statement of amber md 
 
code being single precision as an explanation - patently incorrect I assure 
 
you.  The code is ALL double precision - and double precision has a 
 
precision of between 16 and 17 decimal digits (the intel and ibm chips 
 
actually have an internal 80 bit representation that increases precision as 
 
long as you stay in fp registers by something like a factor of a thousand 
 
(off the top of my head)).  So this means that every fp arithmetic operation 
 
is off by roughly 1 part in 10**17 in amber.  This is fairly high 
 
precision...  Now, we do take liberties in the direct force calcs in that we 
 
use spline tables for calculation of direct forces and energies, and this 
 
lowers the precision of those operations to as low as 11 decimal digits of 
 
precision, but that's as bad as it gets, in terms of easily defined losses 
 
of precision (there are lots of other things in the numerical methods that 
 
will cause the number to be "not quite right"; heck, every transcendental 
 
function you use has some of this).  Okay, but back to the parallel problem. 
 
A parallel system, correctly implemented, insures correct results in that 
 
the correct operations will be done regardless of how the problem is 
 
parallelized.  However, a given ordering of equivalent operations is not 
 
guaranteed; eg. a + b + c may be executed as (a + b) + c one time and as a + 
 
(b + c) another time.  Same result, right?  NO, not in a floating point 
 
number system, where at the limit of precision you may be off by 1 least 
 
significant digit for equivalent operations done in different orders.  So, 
 
each step of dynamics in a 100K system takes about 1 second on an intel 3 
 
GHz chip.  That is 3 billion math operations roughly speaking (maybe really 
 
half that, as there is a tremendous amount of integer bookkeeping and memory 
 
moves, especially in parallel code).  But anyway, there are roughly 3 
 
billion operations, and for pmemd results are typically reproducible in 
 
parallel runs for between 300 and 500 steps.  So lets rough that out as 
 
3*10**9 * 333 steps is approx 1*10**12 math operations before there is a 
 
noticeable loss in precision in the 4th decimal place.  So it all boils down 
 
to the 1) the tremendous number of calculations done, 2) rounding error in 
 
the last decimal place (16th-17th) that varies depending on the ordering of 
 
equivalent operations, 3) network indeterminacy in order of addition of 
 
forces, energies, virials (primarily).  What do I mean by network 
 
indeterminacy?  There is no fixed order of i/o completion in a distributed 
 
asynchronous operation - if one node stalls a little bit due to a 
 
nondeterministic event (like some service timer pops and the OS spends 
 
10,000 cycles checking on something, or the interconnect services a 
 
communications request less rapidly due to loading by jobs on other nodes) - 
 
then the order of equivalent operations may vary and the result of say an 
 
addition of 3 numbers from 3 different nodes may differ by one least 
 
significant digit in different runs.  Where else does grief arise?  Well 
 
compile pmemd using fftw for your fft's, and run a little experiment with 
 
the uniprocessor code.  Do 10 runs at different times on your machine - all 
 
about 300 steps on something like factor ix.  Look at results; they MAY be 
 
slightly different.  Why?  FFTW does short little timing tests to pick the 
 
best of several different algorithms to use on a given run.  The algorithm 
 
it chooses may differ from run to run, because for some there is not much 
 
difference, so depending on what the various services on you workstation are 
 
doing, sometimes the really optimal algorithm loses and another is chosen. 
 
Now, fast fourier transforms are numerical methods, and most numerical 
 
methods are even more imprecise than the number system representation 
 
(otherwise you are really wasting time...), so the different algorithms get 
 
slightly different results.  And I have not mentioned here two other 
 
factors - 1) different machines, and 2) my use of at least a dozen different 
 
codepaths (basically equivalent algorithms, but once again the order of 
 
equivalent operations may vary) depending on what is optimal for a given 
 
machine.
 
 All these errors are considered insignificant because the real accuracy of 
 
the results you are producing is way below the precision errors here. 
 
Absolute reproducibility is essential for developing code reliably, and I 
 
have to be sure that all significant codepaths get exercised within the 
 
"window of reproducibility" I have to work with, but the significance of the 
 
precision-related errors in terms of the results is nil.  The general 
 
handwaving done is to simply say we are "sampling different regions of phase 
 
space", and we are.  Now, if you really want to worry about things that 
 
significantly affect whether your results are "good", then I would recommend 
 
worrying about 1) the quality of parameterization of the forcefield - all 
 
current forcefields are works in progress (and take a gander at the 
 
precision of the empirical constants in the ff files), 2) step integration 
 
issues - like timestep size, shake params, etc., and 3) sampling 
 
statistics - did you really run long enough to know that the results you are 
 
seeing are representative of the system?  How do you know?  (I leave it for 
 
the folks that specialize in those areas to answer these sorts of questions 
 
;-)).
 
Best Regards - Bob Duke
 
 PS - the best article in my opinion, on floating point issues is:
 
David Goldberg,
 
"What Every Computer Scientist Should Know about Floating-Point Arithmetic",
 
ACM Computing Surveys 23 (1991), pp5-48
 
 ----- Original Message ----- 
 
From: "Benjamin Juhl" <itbbju_at_itb.uni-stuttgart.de>
 
To: <amber_at_scripps.edu>
 
Sent: Monday, December 17, 2007 5:25 AM
 
Subject: Re: AMBER: pmemd: same rst, different result?
 
 > Hi Anselm,
 
>
 
> we observed the same thing recently and found that the explanaitions
 
> from the following mail in the amber mailing list archive covers about
 
> all your questions:
 
> http://amber.ch.ic.ac.uk/archive/200401/0103.html
 
>
 
> Benjamin
 
>
 
>
 
> Anselm Horn schrieb:
 
>> Dear all,
 
>>
 
>> we encountered a strange behaviour during a pmemd restart on our cluster
 
>> (infiniband network, job ran on 8 nodes with 4 processors each):
 
>>
 
>> A restart using a certain rst file crashed because there were numbers
 
>> too large for the output format and thus only asterics appeared (known
 
>> problem). When we reran the MD calculation starting from the previous
 
>> rst file that also served as input for the trajectory that produced the
 
>> corrupted rst file, pmemd then produced a valid rst file at the end of
 
>> this new trajectory. Thus, the same rst file on the same (homogeneous)
 
>> cluster with the same executable yielded a different result. How can
 
>> this be explained?
 
>>
 
>> Is it possible that the parallel implementation is the reason for this
 
>> non-deterministic behaviour - or does this simply indicate a potential
 
>> hardware problem on some of the machines?
 
>>
 
>> Any hints are welcome.
 
>>
 
>> Regards,
 
>>
 
>> Anselm Horn
 
>> Bioinformatik
 
>> Emil-Fischer-Zentrum
 
>> Friedrich-Alexander-Universität Erlangen-Nürnberg
 
>>
 
>>
 
>>
 
>>
 
>> -----------------------------------------------------------------------
 
>> The AMBER Mail Reflector
 
>> To post, send mail to amber_at_scripps.edu
 
>> To unsubscribe, send "unsubscribe amber" to majordomo_at_scripps.edu
 
>>
 
>>
 
>
 
> -----------------------------------------------------------------------
 
> The AMBER Mail Reflector
 
> To post, send mail to amber_at_scripps.edu
 
> To unsubscribe, send "unsubscribe amber" to majordomo_at_scripps.edu
 
> 
 
 -----------------------------------------------------------------------
 
The AMBER Mail Reflector
 
To post, send mail to amber_at_scripps.edu
 
To unsubscribe, send "unsubscribe amber" to majordomo_at_scripps.edu
 
 
  
 |