AMBER Archive (2008)

Subject: Re: AMBER: amber 10: sander and pmemd performance

From: Vlad Cojocaru (Vlad.Cojocaru_at_eml-r.villa-bosch.de)
Date: Fri Jul 25 2008 - 10:14:08 CDT


Dear Bob,

Thanks a lot for your email. For a person that just realizes how
important is to compile and use a personalized form of a piece of
software such as AMBER, the details you give here are very important. In
the last week or so I learned a lot about compilation, libraries,
compilers and so on... and it all started with that "output problem" I
reported some weeks ago when I realized that I am better of doing the
compilation myself.

So, for now I am far from wanting to run pmemd at the highest
performance level possible. I managed to compile the versions I
mentioned previously and they all run fine for now. For my system, the
speed achieved on 4 cores (1 AMD opteron node with 2 dual core CPUs) of
about 0.27 ns/day on 64K atoms system (nve, t.s = 1fs) is fine for now.
We have small gigabit ethernet clusters, so I didnt put too much effort
in testing at higher processor count using different nodes for the same
run because the scaling of pmemd and sander is very poor on our cluster.
And we had some nasty I/O problems when running jobs on different nodes.

 I was just surprised that compiling with ifort did not improve the
performance of pmemd comparing to gfortran. Also, I was a bit surprised
to see that pmemd compiled with ifort+mpich2 is about 5 % slower than
pmemd compiled with ifort+opempi. I thought maybe there is some obvious
option for the ifort compiler that I didn't consider and that was the
reason I asked the question.

However, soon we'll get a infiniband cluster of AMD opterons (do not
know the exact configuration yet) and for sure I'll be using the
information you send here to build AMBER10 for that cluster. So, I'm
happy that you took your time and write down all these details.

Thanks again

Cheers
vlad

Robert Duke wrote:
> There are probably at least 10, if not 20 different things going on
> here, some of which you are talking about, some of which you are not.
> I have no idea how many porcessors you are using. I don't know your
> interconnect. This stuff can be impacted by 1) compiler choice, 2)
> compiler options choice, 3) mpi choice, 4) how mpi was built, 5) how
> mpi was configured, 5a) how the system communications stacks are
> configured, 6) how pmemd was configured to be optimized given the
> hardware and software in play, 7) the hardware that is being used, in
> terms of specfics about a) cpu speed, b) cpu cache size, c) multicore
> impacts on memory and other communications bandwidths, d) the system
> buses in use, e) the net cards in use, 8) the actual benchmarks in
> use - the size of the benchmark can make a big difference in
> performance, depending on how the modeled system size matches the
> cache size, and the processor count (so as the processor count comes
> down, more is done in each individual processor in terms of total
> memory requirements, and at some point you run out of cache, and that
> can really make a difference in performance, for example). A wide
> range of options chosen in mdin can totally whack performance. So
> what I did in the amber 8 and 9 timeframes is cook up a bunch of
> specific configurations with known characteristics, and I carefully
> optimized the software and provided configuration options to target
> these machines. It is not a simple matter to then move to any new
> machine/new compiler/new implementation of any other supporting
> library and see performance STAY THE SAME, LET ALONE GET BETTER. It is
> really really really really easy to dink up the performance of this
> sort of code; sad but true. It basically is optimized to sit on the
> edge of a bunch of interlocking bottlenecks; push it a little in any
> direction, and you start running slower for a different reason. So I
> am sorry about that, but the community needs to maybe realize that
> this is not a simple matter. I have spent a couple of decades working
> to some extent or another on performance and reliability issues on
> computers in the computer industry, and I have to approach each new
> configuration carefully, or I won't get particularly good performance
> (I actually build about two dozen different versions of pmemd and just
> run them at present). I choose not to spend the rest of my life doing
> this for every combination of hardware and software folks can dream
> up; I would really recommend that if you are serious about running
> amber fast, you take a look at what we support well currently, and
> consider making purchases in that direction. Right now, that probably
> means that the best choice in mpi (for cluster builders) is good
> infiniband hardware + mvapich and the ifort compiler. I would choose
> faster cpu's, lower core count, hang the additional cost, if molecular
> dynamics is your thing. If I had or really wanted amd processors, I
> would choose pathscale compilers - they are fast and work well. PGI
> is my third choice in compilers, but this may be because of past
> issues that are not that big a deal anymore - they have made an honest
> effort to respond to past problems and should be given credit. For
> me, intel compilers have always been pretty darn fast, but a bit of a
> pain to use in terms of them changing things; still they really know
> how to write a code optimizer. If you are stuck with ethernet, well,
> don't expect much, but we support lam, mpich 1, mpich 2, they all work
> well, and they are really pretty easy to install (I think lam in
> particular may be pretty easy; for historical reasons mostly I settled
> in on mpich 1-vintage stuff myself; I found that all the additional
> features of mpich 2 where mostly a hassle for my small clusters). If
> you want to run a totally new configuration and see what it will do
> with pmemd, then you need to 1) optimize the mpi configuration,
> carefully, on your machine, 2) optimized the compiler configuration,
> carefully, on your new machine, with pretty aggressive compiler
> options, and 3) go through building pmemd every way possible (all the
> various optimization options) to see what you get. Then run different
> size benchmarks, different size runs, and on and on. And another big
> point. If you are not willing to spend money on the rest of your
> configuration, to get stuff that is recommended and known to work, and
> spend the effort to set up the recommended configurations, then maybe
> you shouldn't be so surprised that you don't get the best
> performance. I wouldn't be...
>
> Two potentially interesting notes:
> 1) You should not really expect pmemd 10 to be much faster than pmemd
> 9 on a small cpu configuration unless you are running NVE or NVT, with
> the default value for ene_avg_sampling; in the development work for
> 10, I found very little aside from things associated with this option
> that would improve performance at low processor count (and as I have
> said elsewhere, for the right nve benchmarks on the right machines,
> the single processor nve performance can be as much as 30% better, but
> this is sort of best case).
> 2) We will reasonably soon be supporting configurations using Intel
> MPI on Infiniband. This stuff has better performance than anything I
> have seen for commodity clusters - SUBSTANTIALLY better, and looks to
> me to be worth the money.
>
> Regards - Bob Duke
>
> ----- Original Message ----- From: "Vlad Cojocaru"
> <Vlad.Cojocaru_at_eml-r.villa-bosch.de>
> To: "AMBER list" <amber_at_scripps.edu>
> Sent: Friday, July 25, 2008 5:57 AM
> Subject: AMBER: amber 10: sander and pmemd performance
>
>
>> Dear ambers,
>>
>> I have compiled AMBER10 with the intel compilers ifort and icc (10.1)
>> with 2 different mpi libs: v1= version with mpich2 1.0.7; v2 =
>> version with openmpi 1.2.6 (MKL were used in both). All MPI versions
>> were compiled with the same compilers as the AMBER package. Netcdf
>> support was included in all compilations.
>>
>> I tested these compilations on a 60K atoms system on 4 cores of an
>> AMD opteron machine with 2 double core CPUs (OS: Debian Linux) .
>> After this I compared with one older compilation of AMBER9 with gcc
>> 4.1 (gfortran) and openmpi 1.2.5 (v3) . I used both NPT and NVE
>> ensemble for testing.
>>
>> To my surprise there is little difference in performance between v1
>> and v2 compilations of AMBER 10 and the old compilation of AMBER 9
>> with gcc. The new sander.MPI is just about 6-8 % faster while the new
>> pmemd is just about the same speed. Interestingly, the v2 compilation
>> (intel+openmpi) was slightly faster than v1 (compiled with mpich2).
>> Also, a different compilation using pgi 7.1 and openmpi 1.2.5 is
>> very similar in performance with the intel ones.
>>
>> In general sander.MPI runs 0.155 to 0.165 ns per day of NPT
>> simulation while pmemd runs 0.24 to 0.25 ns/day of NPT simulation and
>> 0.245 to 0.2670 ns/day of NVE simulation . All simulations have a
>> time step of 1fs.
>>
>> I would like to ask you if according to your experience these
>> performance parameters are what one would expect on such a machine? I
>> was hoping that the intel compilers would compile significantly
>> faster executables (around 15-20 %) comparing to gfortran but this is
>> not the case (or maybe the increase in performance comes with higher
>> CPU counts ?). Is there something one can play with during the
>> compilation with the intel compilers to increase performance ? There
>> are several messages int he AMBER archives suggesting that the intel
>> compilers provide faster executables ....
>>
>> Best wishes
>> vlad
>>
>>
>> --
>> ----------------------------------------------------------------------------
>>
>> Dr. Vlad Cojocaru
>>
>> EML Research gGmbH
>> Schloss-Wolfsbrunnenweg 33
>> 69118 Heidelberg
>>
>> Tel: ++49-6221-533266
>> Fax: ++49-6221-533298
>>
>> e-mail:Vlad.Cojocaru[at]eml-r.villa-bosch.de
>>
>> http://projects.villa-bosch.de/mcm/people/cojocaru/
>>
>> ----------------------------------------------------------------------------
>>
>> EML Research gGmbH
>> Amtgericht Mannheim / HRB 337446
>> Managing Partner: Dr. h.c. Klaus Tschira
>> Scientific and Managing Director: Prof. Dr.-Ing. Andreas Reuter
>> http://www.eml-r.org
>> ----------------------------------------------------------------------------
>>
>>
>>
>> -----------------------------------------------------------------------
>> The AMBER Mail Reflector
>> To post, send mail to amber_at_scripps.edu
>> To unsubscribe, send "unsubscribe amber" (in the *body* of the email)
>> to majordomo_at_scripps.edu
>>
>
> -----------------------------------------------------------------------
> The AMBER Mail Reflector
> To post, send mail to amber_at_scripps.edu
> To unsubscribe, send "unsubscribe amber" (in the *body* of the email)
> to majordomo_at_scripps.edu
>

-- 
----------------------------------------------------------------------------
Dr. Vlad Cojocaru

EML Research gGmbH Schloss-Wolfsbrunnenweg 33 69118 Heidelberg

Tel: ++49-6221-533266 Fax: ++49-6221-533298

e-mail:Vlad.Cojocaru[at]eml-r.villa-bosch.de

http://projects.villa-bosch.de/mcm/people/cojocaru/

---------------------------------------------------------------------------- EML Research gGmbH Amtgericht Mannheim / HRB 337446 Managing Partner: Dr. h.c. Klaus Tschira Scientific and Managing Director: Prof. Dr.-Ing. Andreas Reuter http://www.eml-r.org ----------------------------------------------------------------------------

----------------------------------------------------------------------- The AMBER Mail Reflector To post, send mail to amber_at_scripps.edu To unsubscribe, send "unsubscribe amber" (in the *body* of the email) to majordomo_at_scripps.edu