AMBER Archive (2004)

Subject: PMEMD and myrinet trouble

From: Chris Moth (Chris.Moth_at_vanderbilt.edu)
Date: Wed Mar 24 2004 - 15:25:57 CST


Hi Robert Duke:

Apologies in advance for what may be a prematurely posed question - but if
you have insight, it could save us a lot of time hunting down a problem.

I am minimizing a solvated protein/ligand system using PMEMD 3.03.

I am seeing reasonable, near-identical results on the following three
platforms:

A - pmemd run on my dual Xeon desktop (Debian Linux - intel ifc 7.1)
B - pmemd run on our SGI RS12000 x 8cpu cluster
C - pmemd run on two CPUs only (one board only) within our 16 cpu ( 8 dual
boards) PIII myrinet cluster (Linux intel ifc 7.1).

D However, when I run 8 or 16 CPUs on our 16 cpu PIII myrinet cluster
(Linux), I get wildly divergent results - energies off by 10,000 and
100,000 kCAL/mol compared to the other 3 platforms. Moreover, the .out
file states that a single solvent atom (which should be free to move in the
minimization) is continuously responsible for the highest positive
energy. So, I'm pretty sure that our multi-board myrinet run with PMEMD is
doing very bad things. But, I'm not getting any error messages from PMEMD
- just the disturbing variances in output.

Everything about the minimizations is identical expect the varying mpirun
commands required on the different platforms. Between platforms C and D I
only change the "-np" parameter from 8 to 2.

In short... any suggestions on how we might troubleshoot pmemd on
myrinet/linux would be greatly appreciated. (I don't personally maintain
the hardware here - so I'm looking for concrete ideas to forward to our
staff who do). If you'd like to look at any of the simulation files, I can
email them to you directly - but it is far too much to post out on the mail
list.

If you strongly suspect this is a hardware problem on our end, I suppose
running sander and looking for similar trouble would be a good next step.

Any advice appreciated.

Thanks as always

Chris