AMBER Archive (2007)

Subject: Re: AMBER: Small scale compute environments

From: Robert Duke (rduke_at_email.unc.edu)
Date: Mon May 14 2007 - 10:59:50 CDT


Hi Folks,
Bud invited me to comment on Ross' comments, so I have embedded a few below.
Ross and I are pretty much sympatico on most computer issues; Ross probably
has a better idea about what is coming down the pipe in terms of new
technology because of his position at a supercomputer center and a broader
base of folks to share opinions/experiences with. I just have a lot of
experience trying to get things to work. Anyway, comments embedded below
between ==> and <==.
Regards - Bob Duke

----- Original Message -----
From: "M. L. Dodson" <mldodson_at_houston.rr.com>
To: <amber_at_scripps.edu>
Sent: Saturday, May 12, 2007 3:13 PM
Subject: AMBER: Small scale compute environments

> Hello Ambers,
>
> I am about to purchase two new compute nodes for my organization.
> I asked Ross Walker to comment on compute nodes for small scale
> Amber simulation environments. In particular, I asked him to
> comment on two Intel quadcore nodes connected by GB ethernet in a
> crossover cable configuration (8 cores total). This is a summary
> of his response. I hope it will be useful for other individual
> investigator Amber environments. I will, in some cases be leaving
> out the technical rationale he gave for his positions. Contact me
> directly by email, and I will forward his whole response email.
>
> +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Hi Bud,
>
>> I'm thinking of two 2.4GHz Intel Core 2 Quad motherboards with
>> GB ethernet talking to each other via a crossover cable. What
>> is your feeling about Core 2 Duo CPUs, Core 2 Quad CPUs, etc.
>
> I have not looked directly into the quad core chips yet. My
> initial instincts are that they will be sorely lacking in memory
> bandwidth. So for reasonably small systems, say 30K atoms or less
> they will likely scream but for > 30K atoms the performance while
> better than 2 cores will not be as great as expected. That said
> for a 4 way calculation you should get pretty good scaling.
==>
The whole multicore phenomenon is a solution to problems in the commodity
market, with heavy emphasis on things like gaming and reducing energy
costs/heating problems, and with no real intention to seriously consider
what happens when you string a bunch of these things together. I have not
looked hard at the numbers, but I believe there is more shared cache
available which is a good thing. Ross has emphasized that nothing has been
done to increase bandwidth to main memory, while the core count has gone up.
I would not be greatly surprised, but would presume Intel will have to
respond to this architectural imbalance at some point in time (which may not
help us at all for a couple of years or more). Because there is more cache
(my interpretation), these new chips actually do better than one would
expect just looking at the relative clock rates, but the net effect of
moving to these new chips is to slow down unless you are dealing with small
pme problems on lower total processor counts or generalized Born (I don't
know if the transcendentals are faster and that helps generalized Born; it
could be a simple caching effect on a smaller problem too). So one thing to
remember is that more cache DOES help with bandwidth problems to main
memory, but at some point limited memory access will kill you. From my
perspective, the real disaster comes in the form of slowing the chips down.
The new quad core dual cpu chips (at least the ones we have at UNC) are
running at ~2.4 GHz. The old dual cpu chips were running at 3.6 GHz. So if
throughput relates directly to clockspeed you would expect 2/3 of the
previous throughput for a given cpu count, and you would not expect to be
able to run more processors than you previously could because the
interconnect has not been substantially upgraded (some of the interconnect
communications is now intranode, so that could help a little, but the
reality seems to be it does not really help much compared to infiniband).
Well, at UNC on our "upgraded" machine what I see, before and after upgrade
is this:

Factor ix - ~91K atoms, pme (my npt version with trajectory writing)

#proc % previous performance

   8 90%
  16 82%
128 79%

JAC - ~23K atoms, pme, nve, no trajectory
   8 105%
  16 106%
  32 105%
  96 91%

generalized Born, COX2 (not my data)
   8 163%
  16 161%
 128 159%

So the generalized Born situation looks very good, but generalized Born is
not interconnect constrained, and is probably managing to run more
effectively "in cache" because the cache is bigger. JAC does okay at low
processor count, but it is a small pme problem. A pme problem of any size
suffers almost immediately from the "upgrade", and you bottleneck at about
the same number of cpu's, so you can't just run more cpu's.

This is all very preliminary data, but for me the biggest surprise is that
generalized Born does so well. But there are questions about quality of
results for generalized Born, and we are all wanting to run larger and
larger pme problems, so I see the above benchmarks as basically "not good
news".
<==
>
> As for a cross over cable Bob Duke who writes PMEMD swears by it
> although I don't think he has tried it with dual quad cores. That
> said he has certainly done it with 2 machines by 2 by dual core so
> 4 to 4 and it works okay so 2x quad core shouldn't be bad. I don't
> have explicit numbers for a crossover cable but from some recent
> things I have been finding out it may be pretty good even with 4
> cpus as long as it is definitely a crossover cable. It seems that
> the diabolical performance these days with gigabit ethernet may
> not simply be because we have maxed out the bandwidth but that the
> cpus are now fast enough to overload the switch... [rationale
> elided] The issue is one of fundamental design. The reason
> infiniband is so much better is not so much that it has a higher
> bandwidth but that it does a transactional based flow control
> mechanism that guarantees a packet can never be lost due to lack
> of buffer space at the receiver or through collision. Plus if a
> packet is lost due to signal degradation with infiniband the
> retransmit time is on the order of ms as opposed to 2 seconds with
> ethernet....

==>
Okay, I used to know a lot about things like ethernet and tcp/ip. For
various reasons it is really not the interconnect of choice, even at 1
gbit/sec, full duplex (effectively 2 gbit/sec). All I have tried, really,
is 2 dual cpu (NOT dual core cpu) 3.2GHz systems interconnected with a XO
cable with server net cards, pci bus (reasonable bus speeds, don't remember
exactly what right now). The reason I liked the XO was 1) that it was cheap
for this simple setup, 2) cheap ethernet hubs clearly did a really lousy job
(my first setup used this - you actually lost ground as you parallelized),
and 3) it was neither practical or necessary for me to reach for higher cpu
count - I mostly use this stuff for preliminary dev and test. So I have
published results for this system in both amber 8 and amber 9. Here they
are:

Intel Xeon gigabit ethernet cluster - FACTOR IX - NPT ensemble, PME,
90,906 atoms

#procs nsec/day scaling, %

    1 0.116 --
    2 0.182 100
    4 0.293 80

*******************************************************************************

Intel Xeon gigabit ethernet cluster - JAC - NVE ensemble, PME, 23,558 atoms

#procs nsec/day scaling, %

    1 0.254 --
    2 0.432 100
    4 0.702 81

I would expect you may still do okay at 8 cpu's, especially since they are
slower cpu's, but in general I don't consider gigabit ethernet to be all
that practical beyond 8-12 cpu's for larger pme problems. I think that on
slower setups without switches and with proper OS setup, the performance
limitations are more related to simple bandwidth issues than anything, but I
don't have my numbers to support this right at hand (I have done a bunch of
calcs of what to expect with large pme problems, but it has all been
back-of-envelope stuff). The basic ethernet protocol is csma/cd (carrier
sense multiple access with collision detection). What this means is that
everyone listens for traffic all the time and everybody picks up all the
incoming packets, decoding the header to determine destination (this could
actually be exploited for broadcast, but it isn't). If two or more nodes
attempt to send at the same time, there is a random exponential backoff and
retry protocol - this really does not take much time as long as we are not
talking really high node count on an ethernet segment (collision probability
is low in the first place). IF the OS tcp/ip buffer sizes have been pushed
pretty high, then there will be buffer space for incoming packets, and stuff
should rarely get dropped on the floor. Put a cheap switch in the mix,
though, and all bets are off on propagation delays and the frequency of
resends required because the switch buffers overflowed. So if you want to
use gigabit ethernet, and want more than 8 cpu, then you had better spend
some money. I don't know what infiniband costs, but it does so much better
that I cannot imagine anyone with the sort of problem we have not seeing the
wisdom in having it (it all depends, do you need hundreds of poorly
communicating nodes, or can you get more done with 128 nodes that can run
pme with better than 50% scaling?).
<==
>
> [summary elided]
>
> You may have to tweak the software network setting a bit to get
> the best performance but it should also be pretty good out the box
> - although I can't guarantee this since I haven't tried it in like
> 4 years.. The advantage of working at a supercomputer center is
> that I don't have to worry about cheap interconnects anymore
> ;-)...
>
>> I don't want to step up to really high speed interconnects
>> because of the cost considerations. Ditto going through a
>> switch. That is why I am thinking about limiting myself to two
>> physical boxes.
>
> Yes this will save you quite a bit. One thing you might want to
> consider as well is seeing how much a dual by quad core would cost
> you - in this way you could get 8 procs in one box and the
> performance would be better than two 4 proc boxes.

==>
Based on no direct experience whatsoever, I agree (in other words, I have
not configured such a box myself). But if you look at the 8 cpu numbers on
the unc machine above, what you basically have is quad core dual cpu boxes
interconnected via infiniband. SO the performance you see here for 8 cpu is
what you should get from a standalone 4*2 core box. It is not that these
quad core dual cpu's are so great; it's that gigabit ethernet is so bad.
Being momentarily optimistic, I think this is one potential gain we will see
in the next couple of years - 8 cpu configs doing mpi via shared memory -
not blazing fast, but faster than the ethernet alternative, with less
footprint and power drain. SO if a lab really has problems that can be
handled by 8 cpu, it is a good solution I think.
<==
>
>> 8GB of memory, BTW.
>
> That's good - although the thing to remember here is memory per
> core so 8 GB is really only 2GB per core - that said that is a
> ton for MD though - sander typically never needs more than about
> 2GB total for even the largest of systems.
==>
Because you have to run a bunch of cpu's anyway to get much speed with md,
and because one of the main memory costs is pairlist size and the pairlist
divides nicely as you add cpu's, I have always found 1 GB/core adequate for
pretty much anything with pmemd. For sander, you may need more; I defer to
Ross on QM/MM, NEB, whatever else - these things could eat some memory I
suppose. I would get good memory chips at a good price point, 1-2 GB per
core max, but that view is influenced by what I do.
<==
>
>> My reading of the benchmarking posts indicate I should scale
>> reasonably well over the quad CPU cores, with quite a bit of
>> falloff when going over the GB EN.
>
> Yes but that is for a switch - which almost certainly (in
> hindsight) had flow control turned off. I would expect at least 5
> to 6 times speed up for 8 cpus with a cross over cable. Plus for
> regular MD you should use PMEMD which will both perform faster and
> scale faster.
>
>> Job mix would be PMEMD classical MD, steered MD (classical and
>> QMMM) and NEB (classical and QMMM), mostly. I understand QMMM
>> will not be very accelerated.
>
> That depends on the ratio of QM to MM. If you are doing pure QM at
> present you'll see pretty much no speedup. However, everything in
> the QM code is parallel in Amber 9 with the exception of the
> matrix diagonalization and the density build. So you can get an
> idea of how well it will scale by looking at the percentage time
> that these two operations take... [rationale elided]
>
> Note NEB should scale really well as replicas get shared out
> across processors. So even a pure QM NEB calculation with 32
> replicas should show close to 8 times speedup on 8 cpus. Amber 10
> will have an even better NEB implementation for parallel - I am
> currently testing this and it will likely scale to 2048+ cpus :-)
>
> [elided]
>
> All the best
> Ross
> +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>
> Hope this is a helpful to others as it was for me. Thanks to Ross
> for sharing his insights to me and to the list.
>
> Bud Dodson
> --
> M. L. Dodson
> Email: mldodson-at-houston-dot-rr-dot-com
> Phone: eight_three_two-five_63-386_one
>
> -----------------------------------------------------------------------
> The AMBER Mail Reflector
> To post, send mail to amber_at_scripps.edu
> To unsubscribe, send "unsubscribe amber" to majordomo_at_scripps.edu
>

-----------------------------------------------------------------------
The AMBER Mail Reflector
To post, send mail to amber_at_scripps.edu
To unsubscribe, send "unsubscribe amber" to majordomo_at_scripps.edu