AMBER Archive (2005)

Subject: RE: AMBER: Interpreting Amber8 Benchmark Results

From: Ross Walker (ross_at_rosswalker.co.uk)
Date: Wed Aug 31 2005 - 15:09:20 CDT


Dear Nikhil,

> I am trying to run Amber8 on EM64T with Intel Compilers. I have never
> done this before and needed help to understand how to interpret the
> benchmark data. The primary purpose of the benchmarks is to understand
> how using different interconnects such as GbE, InfiniBand, 10 GigE may
> affect the application performance.
>
> - What number should I be looking at from the benchmark summary report
> (attached).?

Ultimately you should be looking at the total time reported at the end of
the output file. This is actually Wall Clock Time (NOT CPU TIME) so will
give you a real understanding of 'time to solution' on a specific machine
setup. You can convert this to pico seconds of molecular dynamics per day
by:

ps per day = total_time * nstlim * dt / 86400

If you are interested in performance in parallel then things get a little
more complicated. You can still run say 1cpu, 2cpu, 4cpu etc and the total
time reported will still be the total wallclock time.

Also reported will be certain communication times such as FFT communication,
CRD distribute time, FRC Collect time etc. These you will find will grow as
the number of cpu's grow. Ultimately these times dominate at large CPUS
leading to simulations actually taking longer on a larger number of cpus.

The easiest way to compare simulations with different interconnects is, if
you can, connect up the exact same system (cpu, memory, disks etc) with each
of the different interconnects and then run the major benchmarks. For each
interconnect you can then plot wall clock time as a function of number of
cpus. You will then see how each different interconnect performs and where
the limit of each interconnect lies. E.g. GigEth typically chokes beyond 16
cpus while Infiniband will perform much better.

Once you have this then it is an easy step to produce price/performance
ratios etc.

> - What tests are more relevant to a customer who may be looking to
> purchase a cluster to running Amber as a primary application?

This really depends on the type of simulations they will be doing. Typically
the bigger the system is the better it scales in parallel. If they are
largely planning on running simulations using periodic boundaries (PME) then
you should look closely at the DHFR benchmark which has 22930 atoms and also
the much larger factor_ix benchmark which has 90906 atoms. Note for these to
run as efficiently as possible in parallel you should use the PMEMD module
of Amber which is designed to run PME calculations as efficiently as
possible in parallel. See the amber website for details on compiling pmemd
on different systems. (BEWARE: while sander v8's total time is wallclock
time, PMEMD's total time is NOT. This can be misleading since on big systems
the efficiency to which the cpus are used can drop rapidly and so the
reported cputime can often be much less than the REAL wall clock time. Thus
here to understand the true throughput of the cluster you should calculate
the wall clock time used by subtracting the reported start time from the
reported end time. - This is also true of any program you benchmark. The
real measure of speed is wall clock time to get the job done. The cpu time
is largely irrelevant to the customer...)

If the end user is looking to do mainly implicit solvent simulations then
you should look at the gb_cox2 and gb_mb benchmarks. These should be run
using sander since pmemd does not support implicit solvent simulations. The
scaling here to large numbers of cpus may not be as good as the explicit
solvent (pme) simulations.

There are many other things that should be considered as well. A large
number of GigE and 10GigE clusters are built with 32 port switches that are
then chained together. Note, if the chaining is NOT non-blocking (e.g. has
less bandwith than the sum of all the ports) then you will see problems with
scaling both above 32 cpus and also if queuing software is used that does
not force locallity of MPI processes to a single switch... This needs to be
considered carefully since under ideal conditions a GigE cluster may seem to
work well once half of it gets loaded up the scaling of jobs on the other
half may suffer due to blocking at the switches.

Suffice to say the speed and bandwith of the backplane are the most
important things when running amber simulations in parallel. Disk speed is
generally a secondary issue.

Also note that faster cpus will generally give worse scaling for a given
backplane since the communication throughput remains fixed but the nuber of
messages that need to be sent per second goes up since the cpus get the job
done faster.

I hope this information helps.

If you want more in-depth advice please email me directly and I can arrange
a time to talk with you.

All the best
Ross

/\
\/
|\oss Walker

| Department of Molecular Biology TPC15 |
| The Scripps Research Institute |
| Tel: +1 858 784 8889 | EMail:- ross_at_rosswalker.co.uk |
| http://www.rosswalker.co.uk | PGP Key available on request |

Note: Electronic Mail is not secure, has no guarantee of delivery, may not
be ready every day, and should not be used for urgent or sensitive issues.

-----------------------------------------------------------------------
The AMBER Mail Reflector
To post, send mail to amber_at_scripps.edu
To unsubscribe, send "unsubscribe amber" to majordomo_at_scripps.edu