AMBER Archive (2005)

Subject: AMBER: PMEMD on Opterons; building/benchmarks/recommendations

From: Robert Duke (rduke_at_email.unc.edu)
Date: Tue Mar 01 2005 - 13:14:56 CST


Folks -
I have just completed a rather extensive study of what works and doesn't
work for pmemd on opterons, and have done a bunch of benchmarks to boot.
All this work was done on a large opteron (1.4 GHz, so not the fastest
currently available) cluster with mpich-gm, mpich 1.2.6, and mpich2
interconnects running 64 bit GNU/Linux. Thanks to Tom Cheatham and the
folks at U of Utah for access to this nice resource. As a preamble to the
benchmarks, let me say that this stuff is incredibly fast for a linux
cluster despite the slower processor speeds; I got 1.99 nsec/day of factor
ix (~91k atoms) and 5.4 nsec/day on the jac benchmark (~23.5k atoms). I
will provide here:

1) the recommendation that folks use pathscale compilers until pgi compilers
are proven to actually work.
2) suggested configurations for the pathscale compilers.
3) benchmarking data for pathscale compilers.
4) recommendations for the slower interconnects.

I have not yet tried using intel ifort to build a 64 bit executable for an
opteron system, but will do that soon.

Okay, I started off trying to build pmemd with PGI because there were issues
"out there" about how well PGI+pmemd is working on other opteron hardware.
PGI has a web page on amber, and I ultimately followed their suggestions,
building a 64 bit executable (no -tp k8-32, use the right (64 bit) pgf90,
pgcc executables, link in all 64 bit libraries). I started off thinking I
was having trouble with mixed 32/64 bit libraries, but this was not the case
(confirmed with the file and ldd commands). The PGI version at U of Utah is
5.2-1. When I ran pmemd built with PGI, I noted 3 things:
1) if you attempt to write coordinates in the jac benchmark (add an ntwx
param to mdin), then pmemd hangs writing blanks very rapidly to the mdcrd
output file. This is a serious problem with the file i/o library, and I
found no workarounds. PMEMD may be more susceptible to this problem than
sander because it uses f90-style i/o statements rather than f77 implied
do's, but come on folks, the f90 spec is now 15 years old. As an aside, pgi
reports problems with 5.2-1 and 5.2-2 that deal with file reads for the
cytosine and pure_wat tests in the amber test suite. I think they recommend
getting the latest update, 5.2-4.
2) The cpu_time() f90 library routine seems to have a resolution of about 8
seconds.
3) If you disable all output so you can run, pgi-compiled pmemd (compiled
per their suggestions, which I basically agree with) is about 3-4% slower
than pathscale f90 for a uniprocessor.

Okay, so the bottom line, if you must use pgi, is:
1) Get the latest update, at least 5.2-4
2) Use the recommendations on their "building amber" web page
http://www.pgroup.com/resources/amber/amber8_pgi52.htm
3) Be incredibly wary of problems with mixed 64 and 32 bit components, all
the way out to the interconnect.
4) Test for i/o problems using the amber test suite, and jac with ntwx=25
added to the mdin &ctrl section.
5) Oh, and by the way you need to pick up amber 8 bugfix.35, which patches a
variable name problem that pgi has (that no other compiler seems to have).

To folks who already have an investment in this compiler, all I can really
say is that it probably will work correctly at some point in time, and PGI
WANTS their stuff to build all of amber correctly; unfortunately 5.2-1
cannot do this yet.

SO, for now I recommend using the pathscale compiler (www.pathscale.com - an
evaluation copy is available) which I heartily recommend, and did not have
much trouble with at all. I have done extensive benchmarking with pmemd
built for uniprocess use and built using the mpich 1.6.3, mpich2, and
mpich-gm interconnects. I have done regression testing (22 tests) for 1 and
4 processors and all pass. To build pmemd using the pathscale compilers,
one has to do the following:
1) Pick up the patch file for pmemd_clib.c which is out on the
amber.scripps.edu website as bugfix.38 (thanks, Dave!) and use this to patch
amber8/src/pmemd/src/pmemd_clib.c. What you are doing here is providing
another name mangling convention between f90 and c (we have to do it this
way instead of using option switches due to the need to link to mpi
libraries).

2) Pick up one of the new config_data files embedded as text below (be
careful about not splitting lines, regardless of how it printed in the
mail), and run pmemd ./configure (the configure in the pmemd dir) on this
file.

a) for mpich_gm, place the following content in
config_data/linux64_opteron.pathf90.mpich_gm:

setenv MPICH_INCLUDE $MPICH_HOME/include
setenv MPICH_LIBDIR $MPICH_HOME/lib
setenv MPILIB
"-L$MPICH_LIBDIR -lmpichf90 -lmpich -L$MPICH_LIBDIR2 -lgm -Wl,-rpath
$MPICH_LIBDIR2 -lpthread"
setenv PREPROCFLAGS "-DDBL_C_UNDERSCORE -DMPI -DDIRFRC_BIGCACHE_OPT"
setenv CPP "/lib/cpp -traditional -I$MPICH_INCLUDE"
setenv OPT_LO "pathf90 -c -O1 -mtune=opteron -msse -msse2"
setenv OPT_MED "pathf90 -c -O2 -mtune=opteron -msse -msse2"
setenv OPT_HI "pathf90 -c -O3 -OPT:Ofast -mtune=opteron -msse -msse2"
setenv LOAD "pathf90"
setenv LOADLIB "$MPILIB"
setenv CC "pathcc -c -O2 -mtune=opteron -msse -msse2"

- and run ./configure linux64_opteron pathf90 mpich_gm

b) for mpich (1.6.3), place the following content in
config_data/linux64_opteron.pathf90.mpich:
setenv MPICH_INCLUDE $MPICH_HOME/include
setenv MPICH_LIBDIR $MPICH_HOME/lib
setenv MPILIB "-L$MPICH_LIBDIR -lmpich -lpmpich"
setenv PREPROCFLAGS
"-DDBL_C_UNDERSCORE -DMPI -DSLOW_NONBLOCKING_MPI -DDIRFRC_BIGCACHE_OPT"
setenv CPP "/lib/cpp -traditional -I$MPICH_INCLUDE"
setenv OPT_LO "pathf90 -c -O1 -mtune=opteron -msse -msse2"
setenv OPT_MED "pathf90 -c -O2 -mtune=opteron -msse -msse2"
setenv OPT_HI "pathf90 -c -O3 -OPT:Ofast -mtune=opteron -msse -msse2"
setenv LOAD "pathf90"
setenv LOADLIB "$MPILIB"
setenv CC "pathcc -c -O2 -mtune=opteron -msse -msse2"

- and run ./configure linux64_opteron pathf90 mpich

c) for mpich2 (version 1), place the following content in
config_data/linux64_opteron.pathf90.mpich2:

setenv MPICH_INCLUDE $MPICH_HOME/include
setenv MPICH_LIBDIR $MPICH_HOME/lib
setenv MPILIB "-L$MPICH_LIBDIR -lmpichf90 -lmpich -lrt "
setenv PREPROCFLAGS "-DDBL_C_UNDERSCORE -DMPI -DDIRFRC_BIGCACHE_OPT"
setenv CPP "/lib/cpp -traditional -I$MPICH_INCLUDE"
setenv OPT_LO "pathf90 -c -O1 -mtune=opteron -msse -msse2"
setenv OPT_MED "pathf90 -c -O2 -mtune=opteron -msse -msse2"
setenv OPT_HI "pathf90 -c -O3 -OPT:Ofast -mtune=opteron -msse -msse2"
setenv LOAD "pathf90"
setenv LOADLIB "$MPILIB"
setenv CC "pathcc -c -O2 -mtune=opteron -msse -msse2"

- and run ./configure linux64_opteron pathf90 mpich2

Several notes about mpich2:
1) It is significantly faster than mpich if you have more than 2 processors.
2) Jobs must be invoked completely differently, so check the manual or site
guide.
3) The configure script shipped with rel 8 for pmemd won't actually work
above without a minor modification. Change line 139 from:
if [ $parallel = mpich ]; then
to:
if [ $parallel = mpich -o $parallel = mpich2]; then

Sorry for the kludge. I will probably release a new configure script with a
bunch of updated config_data files in the near future, so I did not issue a
bugfix for this.

d) for a uniprocessor version, place the following content in
config_data/linux64_opteron.pathf90.nopar:

setenv PREPROCFLAGS "-DDBL_C_UNDERSCORE -DDIRFRC_BIGCACHE_OPT"
setenv CPP "/lib/cpp -traditional"
setenv OPT_LO "pathf90 -c -O1 -mtune=opteron -msse -msse2"
setenv OPT_MED "pathf90 -c -O2 -mtune=opteron -msse -msse2"
setenv OPT_HI "pathf90 -c -O3 -OPT:Ofast -mtune=opteron -msse -msse2"
setenv LOAD "pathf90"
setenv LOADLIB
setenv CC "pathcc -c -O2 -mtune=opteron -msse -msse2"

- and run ./configure linux64_opteron pathf90 nopar

Okay, its a mess but it will work...

Now, why bother? Well, this stuff is really fast. The mpich_gm interface
is recommended, as it produces the best scaling. First I will give a couple
of benchmarks for pmemd using the mpich_gm interconnect, and then show some
benchmarks for mpich and mpich2. You would be nuts not to use mpich_gm on
this hardware, but I present data on the other two interconnects if that is
all you have available. I would now strongly recommend mpich2
(www-unix.mcs.anl.gov/mpi/mpich2) if you have gigabit ethernet, at least
from a performance perspective.

All tests except as noted are:
1.4 GHz opterons running pmemd rel 8 at U of Utah.

Factor ix benchmark (constant pressure), 90906 atoms pme, explicit solvent
simulation, mpich_gm interconnect:

#proc psec/day scaling(%)

 1 60 -
 2 111 100
 4 215 96
 8 398 89
16 749 84
24 974 73
32 1336 75
40 1542 69
48 1662 62
56 1851 59
64 1994 54

JAC (joint amber charm) benchmark (constant volume), 23558 atoms pme,
explicit solvent simulation, mpich_gm interconnect; here I use default
skinnb values, which is a fair way to run this test (it has no effect on
output, it is a performance optimization, and specifying the wrong value can
de-optimize the code):

#proc pmemd psec/day scaling(%) sander 8 psec/day
  1 124 - 115
  2 235 100 231
  4 461 98 424
  8 873 93 752
 16 1630 87 1170
 32 2979 79 1464
 48 4019 71 nd
 56 4547 69 nd
 64 4800 64 nd
 72 5400 64 nd

These are clearly the best numbers I have seen for a linux cluster with a
(relatively) cheap interconnect.

Okay, what about GB ethernet? Well, mpich2 (ver 1) does fairly well,
considering. These tests are again with factor ix (90906 atoms, const
pressure), which uses large fft's and beats the heck out of the
interconnect. Still, these are the best numbers I have seen for this sort
of interconnect:

Factor ix benchmark (constant pressure), 90906 atoms pme, explicit solvent
simulation, mpich2 (ver 1) interconnect with GB ethernet, DON'T
USE -DSLOW_NONBLOCKING_MPI:

#proc psec/day scaling(%)

  1 60 -
  2 109 100
  4 193 89
  6 273 84
  8 350 81
 12 499 76
 16 642 74
 24 864 66
 32 997 57

NOT using -DSLOW_NONBLOCKING_MPI is optimal over the range of 4-32 procs
with mpich2

Finally, here are some number for mpich 1.6.3:

Factor ix benchmark (constant pressure), 90906 atoms pme, explicit solvent
simulation, mpich (ver 1.2.6) interconnect with GB ethernet, DO
USE -DSLOW_NONBLOCKING_MPI:

#proc psec/day scaling(%)

  1 60 -
  2 109 100
  4 184 85
  6 252 77
  8 322 74
 12 450 69
 16 554 64
 24 720 55
 32 864 50

USING -DSLOW_NONBLOCKING_MPI is optimal over the range of 12-32 processors.
For fewer processors, you will do about as well or slightly better without
using -DSLOW_NONBLOCKING_MPI.

I am going to check out ifort on the opteron and also update the various
linux configurations in the near future, and will hopefully be able to make
a tarball of config info available.

Regards - Bob Duke

-----------------------------------------------------------------------
The AMBER Mail Reflector
To post, send mail to amber_at_scripps.edu
To unsubscribe, send "unsubscribe amber" to majordomo_at_scripps.edu