AMBER Archive (2004)

Subject: Re: AMBER: PMEMD and myrinet trouble

From: Robert Duke (rduke_at_email.unc.edu)
Date: Wed Mar 24 2004 - 15:56:10 CST


Chris -
I would first suspect the hardware, based on my past experience with
myrinet. I have seen the UNC facility give spurious results when there was
failing myrinet hardware or cable problems (I have seen this a couple of
times, once when the cables were being messed with, and once when there were
failing myrinet components). Not something that makes one happy, by any
means. I would check with my hardware folks, see if the cables maybe got
disturbed, see if any of the myrinet gear fails diagnostics, etc. I presume
since these are PIII's it is older hardware? Past that, I think all you can
do is more runs, looking for repeatability in pmemd, and looking for
problems in sander. I am not aware of any pmemd 3.03 software problems, but
we all know that is not a guarantee. I have generally been somewhat
confident because of extensive testing from low to high processor count
(128+) on ibm sp3's and sp4's. I have not seen any problems there, ever,
but that is pretty solid hardware. Please let me know if you find anything
else out.
Regards - Bob
----- Original Message -----
From: "Chris Moth" <Chris.Moth_at_vanderbilt.edu>
To: <amber_at_scripps.edu>
Sent: Wednesday, March 24, 2004 4:25 PM
Subject: AMBER: PMEMD and myrinet trouble

> Hi Robert Duke:
>
> Apologies in advance for what may be a prematurely posed question - but if
> you have insight, it could save us a lot of time hunting down a problem.
>
> I am minimizing a solvated protein/ligand system using PMEMD 3.03.
>
> I am seeing reasonable, near-identical results on the following three
> platforms:
>
> A - pmemd run on my dual Xeon desktop (Debian Linux - intel ifc 7.1)
> B - pmemd run on our SGI RS12000 x 8cpu cluster
> C - pmemd run on two CPUs only (one board only) within our 16 cpu ( 8 dual
> boards) PIII myrinet cluster (Linux intel ifc 7.1).
>
> D However, when I run 8 or 16 CPUs on our 16 cpu PIII myrinet cluster
> (Linux), I get wildly divergent results - energies off by 10,000 and
> 100,000 kCAL/mol compared to the other 3 platforms. Moreover, the .out
> file states that a single solvent atom (which should be free to move in
the
> minimization) is continuously responsible for the highest positive
> energy. So, I'm pretty sure that our multi-board myrinet run with PMEMD
is
> doing very bad things. But, I'm not getting any error messages from PMEMD
> - just the disturbing variances in output.
>
> Everything about the minimizations is identical expect the varying mpirun
> commands required on the different platforms. Between platforms C and D I
> only change the "-np" parameter from 8 to 2.
>
> In short... any suggestions on how we might troubleshoot pmemd on
> myrinet/linux would be greatly appreciated. (I don't personally maintain
> the hardware here - so I'm looking for concrete ideas to forward to our
> staff who do). If you'd like to look at any of the simulation files, I
can
> email them to you directly - but it is far too much to post out on the
mail
> list.
>
> If you strongly suspect this is a hardware problem on our end, I suppose
> running sander and looking for similar trouble would be a good next step.
>
> Any advice appreciated.
>
> Thanks as always
>
> Chris
>
> -----------------------------------------------------------------------
> The AMBER Mail Reflector
> To post, send mail to amber_at_scripps.edu
> To unsubscribe, send "unsubscribe amber" to majordomo_at_scripps.edu
>

-----------------------------------------------------------------------
The AMBER Mail Reflector
To post, send mail to amber_at_scripps.edu
To unsubscribe, send "unsubscribe amber" to majordomo_at_scripps.edu