AMBER Archive (2002)

Subject: Problems with parallel computing in Amber7

From: Vincent BOSQUIER (vbosquier_at_nimes.syntem.com)
Date: Tue May 14 2002 - 09:28:10 CDT


Hi all,

I have some problems with Amber7 that runs on my IBM/Intel/Linux cluster.
When I run sander on more than 16 processors, it returns "nan" values in my .out and .traj files. During a series of tests, my $DO_PARALLEL environnment variable was successively set to :
a) $MPICH_HOME/bin/mpirun -np 16 -machinefile /<my_home_directory>/mpich_machines.16CPU_LINUX0108
b) $MPICH_HOME/bin/mpirun -np 16 -machinefile /<my_home_directory>/mpich_machines.16CPU_LINUX1118
c) $MPICH_HOME/bin/mpirun -np 64 -machinefile /<my_home_directory>/mpich_machines.76CPU_LINUX
d) $MPICH_HOME/bin/mpirun -np 32 -machinefile /<my_home_directory>/mpich_machines.32CPU_LINUX0116
e) $MPICH_HOME/bin/mpirun -np 32 -machinefile /<my_home_directory>/mpich_machines.32CPU_LINUX1736
f) $MPICH_HOME/bin/mpirun -np 8 -machinefile /<my_home_directory>/mpich_machines.8CPU_LINUX3740
g) $MPICH_HOME/bin/mpirun -np 4 -machinefile /<my_home_directory>/mpich_machines.4CPU_LINUX4142

What I obtained was :
a) no problem.
b) no problem.
c) Error on processor 29 "inhomogenous ..." and stop during the 1st picosecond of simulation (note : I had a machine file containing a list of 76 available CPU distributed into 38 servers and I ran sander with "mpirun -np 64" => could this generate problems?).
d) Sander returns "nan" values (not-a-number??) during the 8th picosecond of simultation.
e) Sander returns "nan" values during the 8th picosecond of simultation.
f) No apparent problem after 10 picoseconds of simulation.
g) No apparent problem after 10 picoseconds of simulation.

I also have problems with stty warnings because stty is associated to "CTRL+C" interruption signal in my users'.cshrc.
In the end, I have noticed that my sander runs are very slow in the cluster version : a test on 16 CPUs distributed in 8 Linux machines in my cluster runs almost as slowly as on my 4 local-CPUs SGI Origin200 server.

So here are my questions :
- What could be the cause of this problem that appears when I increase my number of CPUs?
- What in Amber7 does the following error message signify : "Processor 29. System must be very inhomogenous. Readjusting recip sizes. etc..."?
- Why do we get "nan" values? And why are those "nan" values returned without any visible warning or error message on screen?

- Which impact on Amber7 could have stty settings in my .cshrc? Is it really important to unset stty before we run sander?
- Why is Amber7 cluster-parallel version that slow?

Thanks in advance for your help.

Vincent.

---------------------------------------------------------------------
Vincent Bosquier
IT Engineer

Synt:em
Computational Drug Discovery
Parc Scientifique G.Besse
Allee Charles Babbage
30035 Nimes Cedex 1
France

E-mail: vbosquier_at_syntem.com
Ligne directe: +33 (0)466 042 294
Standard: +33 (0)466 048 666
Fax: +33 (0)466 048 667
---------------------------------------------------------------------
Discover New Drugs, Discover Synt:em
        http://www.syntem.com
---------------------------------------------------------------------