AMBER Archive (2005)

Subject: Re: AMBER: pmemd - bug report

From: Robert Duke (rduke_at_email.unc.edu)
Date: Mon Jul 04 2005 - 15:30:54 CDT


Petr -
Thanks for the problem report. I have not yet thoroughly looked at
everything yet, but it looks to me like the whole problem hinges on whether
a fractional coordinate calculation can produce a result outside the range
of 0.0 to 0.999... (ie., if this assertion is correct, bad things will
follow elsewhere; no need to even check). Now, the basic fractional
coordinate calculation is of the form:
fract = f1 - dnint(f1) + 0.5d0
Have you actually trapped a value outside the range, or is this a postulate
to explain results? According to the definition of dnint() (f95 reference
standard) dnint(a) has the value of dint(a + 0.5) (actually the ref just
defines anint/aint, but dnint/dint are just the specific double precision
intrinsics), so for instance dnint(1.5) ALWAYS should round up 2.0; given
that 0.5 is precisely representable internally in binary, there should be no
rounding errors that can produce a 1.0 result from this calculation (for
negative values it is the same story, but dnint(a) has the value dint(a -
0.5). I have not thought about this in a while, but remember looking it
over fairly closely when the code was first done; the basic calculation and
assumptions also exist in sander; I think this stuff all derives from Tom
Darden's original pbc code, but maybe I am wrong about that. I'll do some
checks with ifort and your samples in the near future, but unless a
fractional is producing 1.0, this is not the problem. If dnint() does
produce such a result, then it is broken; there have been broken intel
library functions before, but one would hope that the math intrinsics are
right.
Regards - Bob Duke

----- Original Message -----
From: "Petr Kulhanek" <kulhanek_at_chemi.muni.cz>
To: <amber_at_scripps.edu>
Sent: Monday, July 04, 2005 12:33 PM
Subject: AMBER: pmemd - bug report

> Dear AMBER developers,
>
> I would like to report a bug in PMEMD code (AMBER 8.0).
>
> The bug is responsible that some atoms are incorrectly addressed to
> respective subcells. The bug occures only if some atoms are very close to
> box walls (confirmed) or if they are on subcell interfaces (just
> hypothesis). As the result, program can crash or the non-bonded energy is
> not calculated correctly. The bug probably depends on architecture and/or
> used compiler.
>
> Test cases (for IA32, ifort 8.1.024, and 1CPU) can be downloaded from:
>
> segmentation fault
> http://www.ncbr.chemi.muni.cz/~kulhanek/downloads/tst1.tar.gz
>
> infinity Eel
> http://www.ncbr.chemi.muni.cz/~kulhanek/downloads/tst2.tar.gz
>
> incorrect Eel
> http://www.ncbr.chemi.muni.cz/~kulhanek/downloads/tst3.tar.gz
>
> In each archive, six results are present:
>
> mdout_bug_reference - calculated by pmemd on coordinates that leads to
> incorrect results
> mdout_bug_imaged_reference - calculated by pmemd on coordinates, which
> were processed by ptraj (solut is centered and system is imaged)
> mdout_fix_reference - calculated by modified pmemd (see below) on
> coordinates that leads to incorrect results with non-modified pmemd
> mdout_fix_imaged_reference - calculated by modified pmemd (see below) on
> coordinates, which were processed by ptraj (solut is centered and system
> is imaged)
> mdout_sander_reference - calculated by sander on coordinates that leads to
> incorrect results with non-modified pmemd
> mdout_sander_imaged_reference - calculated by sander on coordinates, which
> were processed by ptraj (solut is centered and system is imaged)
>
>
> The bug is located in following files:
>
> ew_direct_cit.f90 -> subroutine get_fract_crds (confirmed)
> fraction could be equal to 1.0 due to round errors, however the code
> suppose that it has to be less than 1.0
>
> ew_direct_cit.f90 -> subroutine setup_crd_idx_lst_tbl (confirmed)
> indexes to crd_idx_lst_tbl array are not checked if they are in legal
> ranges
> if fractions calculated in get_fract_crds are equal to 1.0 then indexes
> overflow crd_idx_lst_tbl boundaries
>
> ew_pairlist.f90 -> subroutine get_nb_list (confirmed) and probably
> get_nb_list_nonorthog (not tested)
> indexes to x_bkts, y_bkts, and especially z_bkts arrays are not checked if
> they are in legal ranges
> array overflows could occur even if fractions are in legal ranges because
> the indexes are calculated directly from imaged coordinates (different
> approach -> different round errors)
>
> ew_recip_cit.f90 -> subroutine claim_recip_imgs (hypothesis) and probably
> claim_recip_imgs_nonorthog (not tested)
> z_bkt_lo and z_bkt_hi, the same as above
>
>
> I prepared two patches for testing purposes. (reference source is amber8
> distro + bugfixes 1-44)
> http://www.ncbr.chemi.muni.cz/~kulhanek/downloads/check.patch
> http://www.ncbr.chemi.muni.cz/~kulhanek/downloads/fix.patch
>
> The first one extends code with boundary check in above mentioned
> subroutines. If some overflow occures then the program is immediatelly
> stop with error message. The second one fixes the problem (boundary check
> is still present).
> fix.patch is only intended for testing purposes because it does not solve
> possible problems in simulations using truncated octahedral box and it was
> not tested under parallel execution.
>
> Some comments at the end:
>
> I found this bug when I did postprocessing analysis of trajectory
> calculated by pmemd (MD was run without any problems as I remember). In
> the analysis, the electrostatic energy is calculated for several system
> subparts still under PBC. In the reported cases, only solut and solvent
> are present (ions were stripped away), so the test cases represent rare
> systems. However observed errors depend only on coordinates and I do not
> see any reason why it could not occur during normal MD !
> I think that it is neccessary to carefully check the pmemd code and remove
> possible problems with reported round errors when subcell indexes are
> calculated. Secondly, it is also important to keep the parts, in which
> subcell indexes are calculated, as same as possible to avoid different
> round errors.
>
> For SANDER developers:
>
> Please look to tst1.tar.gz archive. mdout_sander_reference and
> mdout_sander_imaged_reference differ. The difference is about 3 kcal/mol
> in Eel as well as in Evdw. I think that this could be a bug in sander.
>
> Best regards,
> Petr
>
> --
> ##################################################
> Petr Kulhanek
> ------------------------------------------------
> kulhanek_at_chemi.muni.cz
> ------------------------------------------------
> National Centre for Biomolecular Research
> Masaryk University Brno
> Kotlarska 2, CZ-611 37 Brno
> Czech Republic
> ##################################################
> -----------------------------------------------------------------------
> The AMBER Mail Reflector
> To post, send mail to amber_at_scripps.edu
> To unsubscribe, send "unsubscribe amber" to majordomo_at_scripps.edu
>

-----------------------------------------------------------------------
The AMBER Mail Reflector
To post, send mail to amber_at_scripps.edu
To unsubscribe, send "unsubscribe amber" to majordomo_at_scripps.edu