AMBER Archive (2006)

Subject: RE: AMBER: CPU utilization with Parallel Sander

From: Ross Walker (
Date: Thu Mar 09 2006 - 10:23:08 CST

Hi Luke,

> I did a lamboot on my cluster on 5 nodes (headnode with
> no-schedule=1,
> each node with 4 CPUs) and decided to run the benchmark with 16 CPUs.
> disturbing thing I saw is that each sander running on each
> CPU was only
> utilizing less than 50% of each CPU cycle. This was not an
> issue when I

A couple of things:

1) You don't say what the specs of the cluster are - most importantly what
the interconnect is. If the interconnect is ethernet (even 10GBps ethernet)
then this is typical. The remaining 50% of the cpu is used by the system to
do all the TCP/IP packet handling, NFS writes of the mdcrd file, output file

2) Is top reporting the correct cpu%? I assume by 4 cpus per node you mean 4
real cpus and not 2 cpus with hyperthreading. If you mean the latter then
you can pretty much ignore the %'s top gives you for cpu usage as they
really don't make much sense in this situation. If however you have 4 real
cpus then it could be one of two things. It could be that top is reporting
the wrong percentage - some versions of top report only 25% usage for 1
processor on a 4 processor machine. Alternatively it could be a limitation
of your cluster's backplane. If the interconnect between nodes is choked
then 50% may be all that you can utilise between mpi sends.

Have you tried just running a single processor test? Does this use 100% of a
cpu? Similarly what happens with 4 cpus on a single machine (i.e. smp) here
you should get pretty close to 100%. Note a factor that most people don't
consider when buying clusters is a function of the backplane to cpu balance.
With 4 way smp nodes your effective interconnect bandwidth per cpu is a
quarter of what it would be with single cpu nodes. You need to take this
into account when designing a cluster for MD simulations. Similarly you have
4 cpus all communicating down the same interconnect at the same time.
Depending on the hardware this can cause problems due to collisions.

Also, you don't mention which benchmark you tried. Typically scaling is
better for bigger systems so try some of the other benchmarks, e.g. FactorIX
instead of JAC.

> had an older cluster with an older Pathscale (version 2) as well as
> older kernel version.

What were the specs of the older cluster? Was it the same interconnect but
slower cpus? If so then this behaviour is typical. As you increase the
individual cpu speed for a given interconnect so the scaling gets worse.
Although total throughput in terms of ns/day can get better.

I would also initially ignore cpu utilization and look at the raw
performance numbers. I.e. how many ns/day do you get for 1,2,4,8,16 cpus?
What is the scaling like compared to the older cluster? - Also, did top on
the older cluster report user and system cpu utilisation in the same number,
whereas the new cluster doesn't?

Finally if you find that the performance is abnormally poor on the new
cluster then it is worth investigating further. Although performance issues
could be a function of a number of things as well as the compiler. E.g. how
mpi was built and installed - is it using the correct interface for
communication? - e.g. you may have infiniband installed but because MPI was
setup wrong it is using the ethernet interconnect instead of the infiniband.
Some linux kernels can also slow things down so updating the kernel may
help. Is the performance poor for other codes? Try some of the benchmarks
that come with the MPI installation.

I hope this helps,
all the best

|\oss Walker

| HPC Consultant and Staff Scientist |
| San Diego Supercomputer Center |
| Tel: +1 858 822 0854 | EMail:- |
| | PGP Key available on request |

Note: Electronic Mail is not secure, has no guarantee of delivery, may not
be read every day, and should not be used for urgent or sensitive issues.

The AMBER Mail Reflector
To post, send mail to
To unsubscribe, send "unsubscribe amber" to