AMBER Archive (2007)

Subject: Re: AMBER: Small scale compute environments

From: M. L. Dodson (mldodson_at_houston.rr.com)
Date: Mon May 14 2007 - 13:32:36 CDT


Robert Duke wrote:
> Hi Folks,
> Bud invited me to comment on Ross' comments, so I have embedded a few
> below. Ross and I are pretty much sympatico on most computer issues;
> Ross probably has a better idea about what is coming down the pipe in
> terms of new technology because of his position at a supercomputer
> center and a broader base of folks to share opinions/experiences with.
> I just have a lot of experience trying to get things to work. Anyway,
> comments embedded below between ==> and <==.
> Regards - Bob Duke
>
> ----- Original Message ----- From: "M. L. Dodson" <mldodson_at_houston.rr.com>
> To: <amber_at_scripps.edu>
> Sent: Saturday, May 12, 2007 3:13 PM
> Subject: AMBER: Small scale compute environments
>
>
>> Hello Ambers,
>>
>> I am about to purchase two new compute nodes for my organization.
>> I asked Ross Walker to comment on compute nodes for small scale
>> Amber simulation environments. In particular, I asked him to
>> comment on two Intel quadcore nodes connected by GB ethernet in a
>> crossover cable configuration (8 cores total). This is a summary
>> of his response. I hope it will be useful for other individual
>> investigator Amber environments. I will, in some cases be leaving
>> out the technical rationale he gave for his positions. Contact me
>> directly by email, and I will forward his whole response email.
>>
>> +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>> Hi Bud,
>>
>>> I'm thinking of two 2.4GHz Intel Core 2 Quad motherboards with
>>> GB ethernet talking to each other via a crossover cable. What
>>> is your feeling about Core 2 Duo CPUs, Core 2 Quad CPUs, etc.
>>
>> I have not looked directly into the quad core chips yet. My
>> initial instincts are that they will be sorely lacking in memory
>> bandwidth. So for reasonably small systems, say 30K atoms or less
>> they will likely scream but for > 30K atoms the performance while
>> better than 2 cores will not be as great as expected. That said
>> for a 4 way calculation you should get pretty good scaling.
> ==>
> The whole multicore phenomenon is a solution to problems in the
> commodity market, with heavy emphasis on things like gaming and reducing
> energy costs/heating problems, and with no real intention to seriously
> consider what happens when you string a bunch of these things together.
> I have not looked hard at the numbers, but I believe there is more
> shared cache available which is a good thing. Ross has emphasized that
> nothing has been done to increase bandwidth to main memory, while the
> core count has gone up. I would not be greatly surprised, but would
> presume Intel will have to respond to this architectural imbalance at
> some point in time (which may not help us at all for a couple of years
> or more). Because there is more cache (my interpretation), these new
> chips actually do better than one would expect just looking at the
> relative clock rates, but the net effect of moving to these new chips is
> to slow down unless you are dealing with small pme problems on lower
> total processor counts or generalized Born (I don't know if the
> transcendentals are faster and that helps generalized Born; it could be
> a simple caching effect on a smaller problem too). So one thing to
> remember is that more cache DOES help with bandwidth problems to main
> memory, but at some point limited memory access will kill you. From my
> perspective, the real disaster comes in the form of slowing the chips
> down. The new quad core dual cpu chips (at least the ones we have at
> UNC) are running at ~2.4 GHz. The old dual cpu chips were running at
> 3.6 GHz. So if throughput relates directly to clockspeed you would
> expect 2/3 of the previous throughput for a given cpu count, and you
> would not expect to be able to run more processors than you previously
> could because the interconnect has not been substantially upgraded (some
> of the interconnect communications is now intranode, so that could help
> a little, but the reality seems to be it does not really help much
> compared to infiniband). Well, at UNC on our "upgraded" machine what I
> see, before and after upgrade is this:
>
> Factor ix - ~91K atoms, pme (my npt version with trajectory writing)
>
> #proc % previous performance
>
> 8 90%
> 16 82%
> 128 79%
>
> JAC - ~23K atoms, pme, nve, no trajectory
> 8 105%
> 16 106%
> 32 105%
> 96 91%
>
> generalized Born, COX2 (not my data)
> 8 163%
> 16 161%
> 128 159%
>
> So the generalized Born situation looks very good, but generalized Born
> is not interconnect constrained, and is probably managing to run more
> effectively "in cache" because the cache is bigger. JAC does okay at
> low processor count, but it is a small pme problem. A pme problem of
> any size suffers almost immediately from the "upgrade", and you
> bottleneck at about the same number of cpu's, so you can't just run more
> cpu's.
>
> This is all very preliminary data, but for me the biggest surprise is
> that generalized Born does so well. But there are questions about
> quality of results for generalized Born, and we are all wanting to run
> larger and larger pme problems, so I see the above benchmarks as
> basically "not good news".
> <==
>>
>> As for a cross over cable Bob Duke who writes PMEMD swears by it
>> although I don't think he has tried it with dual quad cores. That
>> said he has certainly done it with 2 machines by 2 by dual core so
>> 4 to 4 and it works okay so 2x quad core shouldn't be bad. I don't
>> have explicit numbers for a crossover cable but from some recent
>> things I have been finding out it may be pretty good even with 4
>> cpus as long as it is definitely a crossover cable. It seems that
>> the diabolical performance these days with gigabit ethernet may
>> not simply be because we have maxed out the bandwidth but that the
>> cpus are now fast enough to overload the switch... [rationale
>> elided] The issue is one of fundamental design. The reason
>> infiniband is so much better is not so much that it has a higher
>> bandwidth but that it does a transactional based flow control
>> mechanism that guarantees a packet can never be lost due to lack
>> of buffer space at the receiver or through collision. Plus if a
>> packet is lost due to signal degradation with infiniband the
>> retransmit time is on the order of ms as opposed to 2 seconds with
>> ethernet....
>
> ==>
> Okay, I used to know a lot about things like ethernet and tcp/ip. For
> various reasons it is really not the interconnect of choice, even at 1
> gbit/sec, full duplex (effectively 2 gbit/sec). All I have tried,
> really, is 2 dual cpu (NOT dual core cpu) 3.2GHz systems interconnected
> with a XO cable with server net cards, pci bus (reasonable bus speeds,
> don't remember exactly what right now). The reason I liked the XO was
> 1) that it was cheap for this simple setup, 2) cheap ethernet hubs
> clearly did a really lousy job (my first setup used this - you actually
> lost ground as you parallelized), and 3) it was neither practical or
> necessary for me to reach for higher cpu count - I mostly use this stuff
> for preliminary dev and test. So I have published results for this
> system in both amber 8 and amber 9. Here they are:
>
> Intel Xeon gigabit ethernet cluster - FACTOR IX - NPT ensemble, PME,
> 90,906 atoms
>
> #procs nsec/day scaling, %
>
> 1 0.116 --
> 2 0.182 100
> 4 0.293 80
>
> *******************************************************************************
>
>
> Intel Xeon gigabit ethernet cluster - JAC - NVE ensemble, PME, 23,558 atoms
>
> #procs nsec/day scaling, %
>
> 1 0.254 --
> 2 0.432 100
> 4 0.702 81
>
> I would expect you may still do okay at 8 cpu's, especially since they
> are slower cpu's, but in general I don't consider gigabit ethernet to be
> all that practical beyond 8-12 cpu's for larger pme problems. I think
> that on slower setups without switches and with proper OS setup, the
> performance limitations are more related to simple bandwidth issues than
> anything, but I don't have my numbers to support this right at hand (I
> have done a bunch of calcs of what to expect with large pme problems,
> but it has all been back-of-envelope stuff). The basic ethernet
> protocol is csma/cd (carrier sense multiple access with collision
> detection). What this means is that everyone listens for traffic all
> the time and everybody picks up all the incoming packets, decoding the
> header to determine destination (this could actually be exploited for
> broadcast, but it isn't). If two or more nodes attempt to send at the
> same time, there is a random exponential backoff and retry protocol -
> this really does not take much time as long as we are not talking really
> high node count on an ethernet segment (collision probability is low in
> the first place). IF the OS tcp/ip buffer sizes have been pushed pretty
> high, then there will be buffer space for incoming packets, and stuff
> should rarely get dropped on the floor. Put a cheap switch in the mix,
> though, and all bets are off on propagation delays and the frequency of
> resends required because the switch buffers overflowed. So if you want
> to use gigabit ethernet, and want more than 8 cpu, then you had better
> spend some money. I don't know what infiniband costs, but it does so
> much better that I cannot imagine anyone with the sort of problem we
> have not seeing the wisdom in having it (it all depends, do you need
> hundreds of poorly communicating nodes, or can you get more done with
> 128 nodes that can run pme with better than 50% scaling?).
> <==
>>
>> [summary elided]
>>
>> You may have to tweak the software network setting a bit to get
>> the best performance but it should also be pretty good out the box
>> - although I can't guarantee this since I haven't tried it in like
>> 4 years.. The advantage of working at a supercomputer center is
>> that I don't have to worry about cheap interconnects anymore
>> ;-)...
>>
>>> I don't want to step up to really high speed interconnects
>>> because of the cost considerations. Ditto going through a
>>> switch. That is why I am thinking about limiting myself to two
>>> physical boxes.
>>
>> Yes this will save you quite a bit. One thing you might want to
>> consider as well is seeing how much a dual by quad core would cost
>> you - in this way you could get 8 procs in one box and the
>> performance would be better than two 4 proc boxes.
>
> ==>
> Based on no direct experience whatsoever, I agree (in other words, I
> have not configured such a box myself). But if you look at the 8 cpu
> numbers on the unc machine above, what you basically have is quad core
> dual cpu boxes interconnected via infiniband. SO the performance you
> see here for 8 cpu is what you should get from a standalone 4*2 core
> box. It is not that these quad core dual cpu's are so great; it's that
> gigabit ethernet is so bad. Being momentarily optimistic, I think this
> is one potential gain we will see in the next couple of years - 8 cpu
> configs doing mpi via shared memory - not blazing fast, but faster than
> the ethernet alternative, with less footprint and power drain. SO if a
> lab really has problems that can be handled by 8 cpu, it is a good
> solution I think.
> <==
>>
>>> 8GB of memory, BTW.
>>
>> That's good - although the thing to remember here is memory per
>> core so 8 GB is really only 2GB per core - that said that is a
>> ton for MD though - sander typically never needs more than about
>> 2GB total for even the largest of systems.
> ==>
> Because you have to run a bunch of cpu's anyway to get much speed with
> md, and because one of the main memory costs is pairlist size and the
> pairlist divides nicely as you add cpu's, I have always found 1 GB/core
> adequate for pretty much anything with pmemd. For sander, you may need
> more; I defer to Ross on QM/MM, NEB, whatever else - these things could
> eat some memory I suppose. I would get good memory chips at a good
> price point, 1-2 GB per core max, but that view is influenced by what I do.
> <==
>>
>>> My reading of the benchmarking posts indicate I should scale
>>> reasonably well over the quad CPU cores, with quite a bit of
>>> falloff when going over the GB EN.
>>
>> Yes but that is for a switch - which almost certainly (in
>> hindsight) had flow control turned off. I would expect at least 5
>> to 6 times speed up for 8 cpus with a cross over cable. Plus for
>> regular MD you should use PMEMD which will both perform faster and
>> scale faster.
>>
>>> Job mix would be PMEMD classical MD, steered MD (classical and
>>> QMMM) and NEB (classical and QMMM), mostly. I understand QMMM
>>> will not be very accelerated.
>>
>> That depends on the ratio of QM to MM. If you are doing pure QM at
>> present you'll see pretty much no speedup. However, everything in
>> the QM code is parallel in Amber 9 with the exception of the
>> matrix diagonalization and the density build. So you can get an
>> idea of how well it will scale by looking at the percentage time
>> that these two operations take... [rationale elided]
>>
>> Note NEB should scale really well as replicas get shared out
>> across processors. So even a pure QM NEB calculation with 32
>> replicas should show close to 8 times speedup on 8 cpus. Amber 10
>> will have an even better NEB implementation for parallel - I am
>> currently testing this and it will likely scale to 2048+ cpus :-)
>>
>> [elided]
>>
>> All the best
>> Ross
>> +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>
>> Hope this is a helpful to others as it was for me. Thanks to Ross
>> for sharing his insights to me and to the list.
>>

I, for one, would like to thank Drs. Walker and Duke for their insights.
  This discussion has certainly focused my thoughts as I move toward
this purchase. I hope it has helped others as well.

One other input, given via an email exhange with a straight computer
science person knowledgeable about the technical issues, but without
experience in numerical computation. I'll just quote him:

> Mm, well I hope you don't need a huge memory bandwidth.. All those
> processors share 1 memory controller.
>
> IMO you'd probably be better off with a 2x dual core Opteron system.

and in another email:

> Seems to me that 2 x dual core Opterons would probably be better since
> the interconnect is Hypertransport which would have much lower
> latency.

These comments were made in the context of 4 CPU cores per physical box.

So, my take on these issues, give the discussions above, is to try to
avoid GB ethernet whenever possible, consistent with the number of CPU
cores desired. And I need to look for numerical benchmarks comparing
current AMD and Intel processors. Preferably Amber benchmarks on
various sized systems.

Thanks again to all who responded.

Bud Dodson

-- 
M. L. Dodson
Email:	mldodson-at-houston-dot-rr-dot-com
Phone:	eight_three_two-56_three-386_one
-----------------------------------------------------------------------
The AMBER Mail Reflector
To post, send mail to amber_at_scripps.edu
To unsubscribe, send "unsubscribe amber" to majordomo_at_scripps.edu