AMBER Archive (2003)

Subject: Re: AMBER: PMEMD: dual XEON vs SGI cluster

From: Robert Duke (rduke_at_email.unc.edu)
Date: Mon Sep 15 2003 - 17:55:28 CDT


Chris -
Some quick comments about timings on sander6, sander7, and pmemd. For
sander 6 and pmemd the times you see at the end of the mdout printout are
mostly cpu times for the MASTER processor (task 0 under mpi), though there
are also setup and nonsetup wallclock times, which for all intents and
purposes apply to all processors in an mpi run. To get individual processor
breakdowns, you really need to look at the logfile (named logfile). It does
not give you the same granularity on the various phases of the pme nonbonded
force calcs, but is more useful in terms of showing what is really going on
with everything else. With sander 7, I believe virtually everything is
reported in wall clock times; I presume, but have not looked to check, that
the mdout times apply to the master, but because they are wallclock times
the times will be closer to the same between the various processors (and
also less informative in a sense). There is a different file named
something like prof_mpi that reports individual processor times. I
personally dislike the sander 7 format for two reasons 1) it is incredibly
complicated, giving way more detail than the average user wants, and 2) I
think it is a really bad idea to get rid of the combination of wallclock and
cpu times. The ratio of nonsetup cpu time to wallclock time is a very
useful number; if it drops below about 0.95 for most hardware, it is an
indication of something pathological on the interconnect, or a basically
poor interconnect (most of the sorts of machine we would use spinlock while
waiting for interconnect communications to complete; if things switch over
to blocking, then things are bad). SO pmemd mostly looks like sander 6 in
terms of the timing outputs, though I did simplify things a bit, and the pme
timing categories reflect changes in the internal algorithms (eg., there is
a cit setup time that is listed which is not present in sander 6 because
there is no such thing; in pmemd this is an interesting number because it
happens to be something that does not scale; therefore it defines a
performance limit). This is probably more than you wanted to know about
what all the times really are.

Anyway, regarding the specifics of your question,
1) The nonbond vs bond vs angle vs dihedral vs shake, etc times you are
seeing only apply to the master.
2) The bond, angle, dihedral, and shake workload associated with each
processor in pmemd 3.00 - 3.02 (soon to be released) are fixed, but they may
be different from processor to processor for constant pressure simulations,
because workload division is molecule-based under CP, and the processors
that get your solute will have more to do. This is typically, but not
always, the master, since with the current pmemd releases the master gets
the first chunk of atoms, typically where the solute is. Also, you may have
more than 1 big molecule, which will affect the sorts of numbers you see.
3) The uneven distribution of workload for bonds, angles, dihedrals, and
shake is not a problem because it is compensated for by dynamic
loadbalancing of the nonbonded force calculations.
4) So look at the logfile instead of mdout to get a better idea of what is
going on, and you will for most machines see an even total workload,
unevenly distributed between the subcategories.
5) For mpich or lammpi running on linux clusters and slow interconnects,
things may not look that even. Basically, the interconnect is not very
good, and by the time you get above 4 processors, there is blocking (and cpu
and wallclock times start diverging). However, dual processors generally do
pretty well unless your problem is just huge, in which case there may be a
bit of cache contention on a dual processor.
6) You are basically seeing close to four times the processor performance
for your linux P4's as you see with the SGI, which is probably close to
correct, given that you have a big problem and the dual P4's probably have
less total cache. Feel free to send me logfiles and more details about your
hardware and runs, and I will take a look and see if anything seems unusual
to me. As you scale up the linux P4, you should expect it to not scale as
well as an sgi origin, especially if you use something like gigabit
ethernet. The interconnect performance is more critical with larger atom
count; the distributed fft overhead is awful in any pme implementation for
this, though the RC fft's used in pmemd and sander 7 help.
7) I am working on pmemd 3.1 which will be a "superscaling version" -
already running 50-100% faster on really big hardware. The workload
division will again look different, so some of the particulars here will
change, but it will still be best to look at the logfile to get an idea of
what is happening. I changed the comments on the mdout to indicate that you
are looking at master timings; I would consider printing average timings if
people really care, but my first implementations tried to be comparable to
sander 6 in a lot of ways.

Regards - Bob Duke

----- Original Message -----
From: "Chris Moth" <Chris.Moth_at_vanderbilt.edu>
To: <amber_at_scripps.edu>
Sent: Tuesday, September 16, 2003 5:46 PM
Subject: AMBER: PMEMD: dual XEON vs SGI cluster

> We routinely perform MD calculations on large unrestrained complexes -
> well over 100,000 atoms. PME/MD has given us roughly a 25% improvement in
> computation time vs sander 7.
>
> I am writing because I believe we should see some more dramatic
> improvements on our new dual Xeon hardware vs 8 way SGI R12000/300MHZ
> cluster. I'd appreciate
> advice on attacking the problem which is.....
>
> Using eight R12000/300MHZ nodes of an SGI origin system, PME/MD
> gives me 50 picoseconds in 25 hours:
>
> | Routine Sec %
> | ----------------------------
> | Nonbond 73130.29 80.85
> | Bond 186.86 0.21
> | Angle 1961.39 2.17
> | Dihedral 7660.73 8.47
> | Shake 466.78 0.52
> | F,Xdist 6102.12 6.75
> | Other 333.14 0.37
> | ----------------------------
> | Total 90448.01 25.12 Hours
>
> | Nonsetup 90442.02 99.99%
>
> | Setup wallclock 6 seconds
> | Nonsetup wallclock 90968 seconds
>
> On my new Dell dual Xeon (2.4GHZ - 1GB RAM) desktop workstation
> running Debian Linux, I get a comparable overall time with PME/MD -
> but with intriguing differences in Nonbond, Bond, Angle, and Dihedral
> contributions to the overall compute time:
>
> | Routine Sec %
> | ----------------------------
> | Nonbond 104180.85 94.41
> | Bond 49.40 0.04
> | Angle 426.57 0.39
> | Dihedral 2417.54 2.19
> | Shake 462.49 0.42
> | F,Xdist 1560.82 1.41
> | Other 115.01 0.10
> | ----------------------------
> | Total 110349.19 30.65 Hours
>
> | Nonsetup 110347.87 100.00%
>
> | Setup wallclock 2 seconds
> | Nonsetup wallclock 113849 seconds
>
> While Bond, Angle, and Dihedral computation times take 1/4 as long on the
> dual Xeon/Linux configuration (wow!), the Nonbond component is 40%
> _slower_.
>
> My first hunch is that mpirun for Linux may not be exchanging the
> large nonbound calculation result sets efficiently between the two
> pmemd processes.
>
> Is this hunch something that we could verify quickly with some additional
> compile time or run time
> options? Or, has anyone else had to work through performance penalties
> between PME/MD implementations on SGI vs Linux?
>
> I do not personally build our software here at Vanderbilt - but I'd
> welcome any suggestions that I could pass along to our admin and support
team.
>
> Thanks very much.
>
> Chris Moth
>
> chris.moth_at_vanderbilt.edu
>
> http://www.structbio.vanderbilt.edu/~cmoth/mddisplay
>
>
>
> -----------------------------------------------------------------------
> The AMBER Mail Reflector
> To post, send mail to amber_at_scripps.edu
> To unsubscribe, send "unsubscribe amber" to majordomo_at_scripps.edu
>
>

-----------------------------------------------------------------------
The AMBER Mail Reflector
To post, send mail to amber_at_scripps.edu
To unsubscribe, send "unsubscribe amber" to majordomo_at_scripps.edu