AMBER Archive (2007)

Subject: Re: AMBER: Sander slower on 16 processors than 8

From: Robert Duke (rduke_at_email.unc.edu)
Date: Thu Feb 22 2007 - 21:27:12 CST


Okay, I am finally getting a moment to look at this. Note that your runtime DOES decrease in the series 1 --> 2 --> 4 --> 8 processors. It is only at 16 processors that you are "in trouble". Further note that the factor_ix benchmark puts a heck of a load on a 1GB ethernet connection. There is a ton of data flying around to support a pme calculation on 91,000 atoms, and it is not all just coordinates and forces. This benchmark uses a 144 x 90 x 80 fft grid, and that represents something like 8 bytes * 144 * 90 * 80 ~= 8.3 x 10**6 bytes, twice per md step for the distributed fft transpose (not strictly true, but approximately true as processor count goes up - there is a portion of the grid that can stay local, but it gets smaller as the processor count goes up). So that is ~ 16.6 megabytes per step flying over a link that can carry 167 megabytes/sec, assuming it is half duplex (if full duplex, twice as much). Okay, the factor ix benchmark clocks away on a single processor for pmemd at about 1 step/sec in pmemd 9. Put it on 10 processors, it should hit 1 step per 0.1 sec. But look, your interconnect is saturated with fft traffic alone, eating up 0.1 sec for just the distributed transpose. Now, pmemd does a better job than sander at all this, but at lower processor counts sander is not that bad at all. It has been steadily improving. I am a computer geek, and I work nearly full time on making pmemd fast (so maybe the comments by the other fellow about the amber dev team were a little off target, as I am not the only computer geek on the team, and our performance is looking rather good compared to other software packages these days). We emphasize throughput, not how many processors we can tie up. However, on really good hardware, pmemd will currently run factor ix reasonably on close to 400 processors, and it will churn out something like 16 nsec/day. For smaller problems, performance is actually better as long as you don't reach for too many processors. The jac benchmark, ~23K atoms, will run at something like 22 nsec/day on 224 processors on the best hardware. I cannot judge whether your times are good because you don't say how many steps you ran, but they don't look that unreasonable for sander in amber 8 if I assume you ran the shipped benchmarks with the shipped mdin files (500 steps for factor ix, 1000 steps for jac). Now go out to the amber web site, amber.scripps.edu, and look at the amber 8 and amber 9 benchmarks there for pmemd. Then ask yourself why you are not running pmemd 9 if you are doing straightforward performance-constrained md. While out there, you may also want to read all the notes I have put up on amber 8 performance issues on clusters (that mostly also apply to amber 9). It is actually sort of hard to get a commodity cluster running well. Ross pointed to some of the problems. Others include:
1) bad switches (Ross mentioned this). I use a XO cable in my test setup to get around this in a small setup; otherwise, you better buy a good (expensive) switch.
2) bad net cards. I use pricey server nics, dedicated, with 1 GB ethernet.
3) bad configuration of the s/w - must use both shmem and p4 devices for mpich, etc. etc. Lots of ways to screw this up. Must configure tcp/ip properly on the OS for optimum performance. I can get mpich, mpich2, lam all to run very well. I never figured out how to get intel mpi to do well, which was pretty annoying, seeing as how it is supposed to be highly configurable (had grief with multiple netcards or something; if you are going to charge me money for mpi, it should be easy to use).
4) undedicated net cards - just do your nsf over the same net card and you have totally hosed yourself.
5) gigabit ethernet - well we said it was not really adequate, period... But it will work "okay" for 4 processors, or maybe 8, if by "okay" you mean the time to run keeps dropping as you add processors, but not near linearly.

Please read the wealth of info on the amber.scripps.edu website, look at pmemd, read the README's that ship with pmemd, consider amber 9 if you care about performance because it is significantly better than 8, which is better than 7, which is better than 6... There are folks out there that produce other packages, comparing themselves to "amber", where they run amber 6 sander and then proclaim that amber cannot do parallel...

Best Regards - Bob Duke

----- Original Message -----
  From: Sontum, Steve
  To: amber_at_scripps.edu
  Sent: Thursday, February 22, 2007 3:32 PM
  Subject: AMBER: Sander slower on 16 processors than 8

  I have been trying to get decent scaling for amber calculations on our cluster and keep running into bottlenecks. Any suggestions would be appreciated. The following are benchmarks for the factor_ix and jac on 1-16 processors using amber8 compiled with pgi 6.0 except for the lam runs which used pgi 6.2

   

  BENCHMARKS

  mpich1 (1.2.7) factor_ix 1:928 2:518 4:318 8:240 16:442

  mpich2 (1.0.5) factor_ix 1:938 2:506 4:262 8:*

  mpich1 (1.2.7) jac 1:560 2:302 4:161 8:121 16:193

  mpich2 (1.0.5) jac 1:554 2:294 4:151 8:111 16:181

  lam (7.1.2) jac 1:516 2:264 4:142 8:118 16:259

   

  * timed out after 3hours

  QUESTIONS

  First off, is it unusual for the calculation to get slower with increased number of processes?

  Does anyone have benchmarks for a similar cluster, so I can tell if there is a problem with the configuration of our cluster? I would like to be able to run on more than one or two nodes.

   

  SYSTEM CONFIGURATION

  The 10 compute nodes use 2.0GHz dual core opteron 270 chips with 4GB memory and 1Mb memory Cache, tyan 2881 motherboards, HP Procurve 2848 switch, and single 1Gb/sec Ethernet connection to each motherboard. The master node is configured similarly but also has a 2TB of raid storage that is automounted by the compute nodes. We are running SuSE 2.6.5-7-276-smp for the operating system. Amber8 and mpich were compiled with pgi 6.0.

   

  I have used ganglia to look at the nodes when a 16 process job is running. The nodes are fully consumed by system CPU time. The User CPU time is only 5% and this node is only pushing 1.4 kBytes/sec out over the network

  Steve

  ------------------------------

  Stephen F. Sontum
  Professor of Chemistry and Biochemistry
  email: sontum_at_middlebury.edu
  phone: 802-443-5445

-----------------------------------------------------------------------
The AMBER Mail Reflector
To post, send mail to amber_at_scripps.edu
To unsubscribe, send "unsubscribe amber" to majordomo_at_scripps.edu