AMBER Archive (2005)Subject: Re: AMBER: amber8 parallel sander
From: Robert Duke (rduke_at_email.unc.edu)
Date: Mon Jan 17 2005 - 12:13:10 CST
Yen -
Some wild guesses here. Less than 100% cpu utilization indicates that you DO have network problems (you are blocked waiting on the network instead of doing calcs). Also, seeing high utilization in the poll() probably indicates that you are wasting a lot of your cpu time spin-locking in the poll, waiting for network communications to occur. A good diagnostic is the time data in the logfile or mpi_profile file. This will give you some indication of how much time is being spent in communications. I would GUESS that you have one or all of the following going on 1) your system tcp/ip buffer size is small, 2) the network cards are slow, and possibly not on a fast system bus, 3) the ethernet switch is not operating at full duplex, and/or is basically slow. Just because you don't appear to be pushing the net to a full GB/sec (actually with full duplex, you push it this fast both ways) does not mean that you don't have a net problem. The problem with your hardware and how the system is configured may well be that they are not truly capable of attaining 2 GB/sec throughput (ie., both ways), or they may just need different configs. IF you are seeing 100% cpu utilization, and it is not in poll() and other network calls, and the metric for FRC_collect time in profile_mpi (or FXdist in the logfile of pmemd) is small, then your network is doing okay.
Regards - Bob
----- Original Message -----
From: yen li
To: amber_at_scripps.edu
Sent: Monday, January 17, 2005 12:55 PM
Subject: Re: AMBER: amber8 parallel sander
Hi Robert & Carlos,
Thanks for the elaborate replies.
It's true that we are sharing the switched GB ethernet for other
purposes also. To check wether network is the chocke point, we collected
network statistics while the benchmarking was running for 16 processors.
We find that the network utilization is never more than 10% for any of
the hosts and the link is up at 1GB/s. While doing this we noticed
unusually high system usage % like +50%. To find out the cause, we
collected the system calls being generated by one process of the 16 processes
("$> truss -p pid"). The results show that it's mostly system call
"poll" which returns 0(like +80% of the time) and error EAGAIN(like 15% of
the time).
For linux the equivalent command of truss would be "$>strace -p pid".
Can someone please suggest any way to improve the performance.
Best Regards,
Yen
> Robert Duke <rduke_at_email.unc.edu> wrote:
> Yen -
> Once again I will chime in after Carlos :-) Especially with GB
ethernet,
> "your mileage may vary" as you attempt to scale up to higher
processor
> count. The performance gain can quickly be inconsequential, and you
can
> even lose ground due to all the parallel overhead issues. Are you
running
> sander or pmemd? PMEMD is faster in these scenarios and takes less
memory,
> but it only does a subset of the things that sander does, and still
cannot
> overcome all the inadequacies of GB ethernet. So what are the major
factors
> that determine how well things work? In a nutshell, they are:
> 1) Problem size. As you have more atoms, it takes more memory on each
node
> for the program AND for MPI communications. I cannot get sander 8 to
run at
> all on 4 pentium nodes/GB ethernet/mpich for the rt polymerase
problem which
> is ~140K atoms. The atoms are constrained which actually makes things
worse
> in some regards (extra data). It runs fine for sander 8 on 1 and 2
> processors. I runs fine for pmemd on 1, 2, and 4 pentium/GB nodes.
The
> really important problem in all this is the distributed fft transpose
for
> reciprocal space calcs. That quickly swamps a GB interconnect for
large
> problem size.
> 2) SMP contribution. An aspect of pentium GB ethernet setups that is
often
> not emphasized is that dual pentium cpu's (they share memory) are
often
> used. Now this is bad from a standpoint of memory cache utilization,
> because each processor has less cache to work with, and depending on
the
> hardware there are cache coherency issues. But from an interconnect
> standpoint it is great because mpi running over shared memory is
> significantly faster than mpi over GB ethernet. So if you have a
collection
> of uniprocessors connected via GB ethernet, I would not expect much.
> 3) Network hardware configuration. Are the network interface cards on
the
> ultrasparc's server-grade; ie. capable of running full bandwidth full
duplex
> GB ethernet without requiring significant cycles from the cpu? If
not, then
> things won't go as well. Server nic's typically cost more than twice
as
> much as workstation grade nics. How about the ethernet switch? Cheap
ones
> DO NOT work well at all, and you will have to pay big bucks for a
good one
> (folks can chime in with suggestions here; I run two dual pentiums
connected
> with a XO cable so there is no switch overhead). Cables? Well, think
CAT
> 5e or 6, I believe. This is getting to be more commonly available;
just be
> sure your cables are rated for 1 Gbit and not 100 mbit/sec.
> 4) MPI configuration. I don't mess with suns, so I have no idea if
they
> have their own mpi, or if you are running mpich or whatever. If you
are
> running mpich, there are ways to screw up the s/w configuration, not
> allowing for shared memory use, not giving mpich enough memory to
work with,
> etc. There are some notes on the amber web site, probably by me,
Victor
> Hornak, et al.
> 5) System configuration. For linux, there are networking config
issues
> controlling how much buffer space is set aside for tcp/ip
communications.
> This can have a big effect on how fast communications actually is,
and
> typically you sacrifice system memory for speed. See the amber web
page for
> notes from me relevant to linux; once again I have no idea what is
required
> for sun hw/sw.
> 6) Other network loading issues. Is the GB ethernet used a dedicated
GB
> ethernet, with no traffic for internet, other machines, NFS, etc.,
etc.? Is
> anyone else using other nodes at the same time (ie., perhaps several
mpi
> jobs running over the same GB ethernet). If there is any other
network load
> whatsoever, your performance will be worse, and it may be
substantially
> worse.
>
> What is the best you can expect? Well, my latest performance work for
pmemd
> (not yet released version) yields the following throughput for my two
dual
> pentiums (3.2 GHz), XO GB-connected, running 90,906 atoms, constant
> pressure, particle mesh ewald:
>
> # proc psec/day
>
> 1 95
> 2 155
> 4 259
>
> Now, you will note that the scaling is not great, and this is about
as good
> as it gets for this kind of hardware. This IS a large problem (91K
atoms),
> and you should do significantly better on scaling if your problem is
> smaller. By the way, comparison numbers for 1 and 2 procs, this same
> hardware, pmemd 8 and sander 8 are:
>
> # proc psec/day, pmemd 8 psec/day, sander8
>
> 1 76
> 54.5
> 2 121
> 88
>
> Now I don't have any data for 8 and 16 processors here, simply
because I no
> longer have access to that type of hardware in reasonable condition.
A
> while back I did runs on 2.4 GB blades at UNC for pmemd 3.0- 3.1, and
was
> seeing numbers like this (abstracted from pmemd 3.0 release notes,
there is
> lots of good info about performance on various systems in the various
pmemd
> release notes, available on the amber web site). NOTE that I had
exclusive
> access to the blade cluster (no other jobs running on it), and the GB
> ethernet was dedicated to mpi, not shared for internet, etc.:
>
>
*******************************************************************************
> LINUX CLUSTER PERFORMANCE, IBM BLADE XEON 2.4 GHZ, GIGABIT ETHERNET
>
*******************************************************************************
> The mpi version was MPICH-1.2.5. Both PMEMD and
> Sander 7 were built using the Intel Fortran Compiler.
>
> 90906 Atoms, Constant Pressure Molecular Dynamics (Factor IX)
> #procs PMEMD Sander 6 Sander 7
> psec/day psec/day psec/day
> 2 85 46 59
> 4 153 71 99
> 6 215 ND ND
> 8 272 114 154
> 10 297 ND ND
> 12 326 122 ND
> 14 338 ND ND
> 16 379 127 183
>
> There are reasons that people spend those large piles of money for
> supercomputers. PMEMD runs with pretty close to linear scaling on
hardware
> with a fast interconnect out to somewhere in the range of 32-64
processes,
> and is usable (greater than 50% scaling) at 128 processors and
beyond. I
> can get close to 7 nsec/day for the problem above on that sort of
hardware
> (once again, unreleased version of pmemd, but you will see it in the
> future).
>
> Regards - Bob Duke
>
>
> ----- Original Message -----
> From: yen li
> To: amber_at_scripps.edu
> Sent: Wednesday, January 12, 2005 8:54 AM
> Subject: Re: AMBER: amber8 parallel sander
>
>
> Hi,
> Thanks Robert & Carlos for the clearifications.
>
> I have one more related doubt. I also timed the same simulations for
the 4
> cases: namely 1, 4, 8 & 16 processors. I find that it's the fastest
for 4
> and slower for 1, 8 &16. I can understand for 1 but cannot understand
it
> getting slower for increased number of processors.
>
> All the processors are of the same make(Sun UltraSparc III+), same
> OS(Solaris 8), same amount of RAM(1GB each) and connected over 1GBps
> network.
>
> Thanks
>
>
> Robert Duke wrote:
> Yen -
> As Carlos says, this is expected. The reason is that when you
parallelize
> the job, the billions of calculations done occur in different orders,
and
> this introduces different rounding errors. With pmemd, you will even
see
> differences with the same number of processors, and this is because
there is
> dynamic load balancing of direct force workload (ie., if one
processor is
> taking more time to do the direct force calcs, it will be assigned
fewer
> atoms to work on). You have to remember that the internal numbers
used have
> way more precision than is justified by our knowledge of parameters,
or for
> that matter how well the method represents reality, and that any one
> represents one of many possible trajectories.
> Regards - Bob Duke
> ----- Original Message -----
> From: yen li
> To: amber_at_scripps.edu
> Sent: Wednesday, January 12, 2005 5:57 AM
> Subject: AMBER: amber8 parallel sander
>
>
> Hello amber
> I am testing the parallel version of amber8. I am running an md
simulation
> over a small protein.
> I am testing the calculations on four, eight and sixteen processors.
My
> problem is that, initially the energy values in the output files are
the
> same, but as the simulation proceeds, the values start to diverge
making the
> differences large. Is this kind of behaviour ok or do i need to take
care of
> some parameters.
> Thanks
>
>
>
> Do you Yahoo!?
> Yahoo! Mail - Find what you need with new enhanced search. Learn
more.
> __________________________________________________
> Do You Yahoo!?
> Tired of spam? Yahoo! Mail has the best spam protection around
> http://mail.yahoo.com
>
>
>
------------------------------------------------------------------------------
Do you Yahoo!?
The all-new My Yahoo! - What will yours do?
-----------------------------------------------------------------------
The AMBER Mail Reflector
To post, send mail to amber_at_scripps.edu
To unsubscribe, send "unsubscribe amber" to majordomo_at_scripps.edu
|