AMBER Archive (2008)
Subject: Re: AMBER: massively parallel computation
From: Robert Duke (rduke_at_email.unc.edu)
Okay, several points. First of all, we (meaning mostly me - we have a lean and mean staffing profile for pmemd, and have been using one guy to basically come pretty close to keeping up with efforts by other groups using between 10 and 30 folks and many many $$$) have had an aggressive parallel performance effort in amber/pmemd for the last several years. We have greatly increased the capabilities of amber in regard, not to it's ability to eat up time on large piles of processors, but in terms of its ability to produce maximal nsec/day simulated time with minimal resources - we emphasize THROUGHPUT, not how many thousands of processors we can tie up (which by the way, most folks can't lay their hands on anyway). We have done this so far without making any compromises whatsoever in terms of accuracy/precision of results using the amber forcefields. Currently, there are some programs/systems running faster than we are. I have not studied this issue extensively as of yet, but I do know that in at least some instances, compromises have been made in arithmetic precision and energy conservation to meet the goal of higher performance. I am interested in these tradeoffs, but completely unwilling to make them without a reasonable degree of certainty that actual quality of results is not being sacrificed (I regard it an open research question). So if you look at our benchmarks pages, we are actually doing quite well against things like namd, though we don't do direct comparisons (and some benchmark comparisons are apples/oranges). I think namd is finally a bit ahead of us due to it's ability to do very fine grained workload distribution through its charm++ (or whatever it is) parallelization layer. This is a great idea in that it allows better overlap between computation and communication than we will ever achieve using fortran 90 plus mpi, but the difference is less than a factor of two (I have not posted amber 10 benchmarks yet; they are better than 9, but not a huge jump up - we are hitting some limitations with pme that are going to be hard to get around). Okay, so we can now get something like 26 nsec/day for ~23K atom pme simulations (JAC) on good hardware (sp5's, for instance, I think I get close to that on lonestar too, but would have to look it up). We do this typically in the range of low 100's of processors, which is pretty good throughput, and performance per atom typically improves as you increase the size of the system, at least into the low 100's of thousands of atoms. We then basically have problems with a fundamental tradeoff we made to get really good performance in the 100's of processor range, and don't go much further. I'll be continuiing to think about the problem, but right now, the machines that are being built are not really addressing our problem space - we need more interconnect bandwidth and lower interconnect latency, and we are not getting it because it is much cheaper to slap together a large pile of multicore chips and count the flops. Okay, finally, to the paper. The paper you are referencing here is about Shaw's NT (neutral territory) method. It is probably a pretty good idea at the limit where the number of atoms each atom interacts with is the limiting factor on performance, because it cuts that number. Mind you, that number remains large, even with NT, but NT does cut it. Well here's the deal. I can do things in pmemd that will cut the atom interaction number in half, and lose performance (well really, I have done a couple of different things that reduced interactions in the 20-50% range I believe, and I lost ground). Why? Because performance is the sum of many many things, and each decision you make has costs, often hidden. So NT undoubtedly has costs, and implementing it in pmemd would require heaven only knows how many months of completely tearing up and reworking fundamental architecture, and in the end, it probably would not help. I recently attended a talk given by Shaw. He is really pushing the limits of performance on the MD problem, but he is doing it by 1) lowering precision, and 2) more importantly, he is practically moving the entire problem into hardware, where he can parallelize everything at the hardware level. So he projects (but does not yet have running, as far as I know) something like 100x speedup over current common codes (I am just pulling this number out of the air as I am not digging out the Anton paper at the moment - it is something like that - maybe more). Now here's the point - I heard him say in the talk that pushing into the "Anton range of performance" - he is finally seeing a NT benefit. So that is where NT can help you, when you have used hardware to slay all the other dragons that kill you. Getting MD to run really really fast is a very icky problem...
Recently, a few algorithms have been developed to enable massively parallel computation which can efficiently use hundreds of CPUs simultaneously for MD simulation. For example, J Comput Chem 26: 1318–1328, 2005.
Is there a plan to implement such algorithm in Amber/PMEMD? As computer cluster is getting cheaper and cheaper, the cluster size keeps expanding quickly as well. Such algorithms should be very helpful and indispensable to reach >ms scale simulation.