AMBER Archive (2007)

Subject: RE: AMBER: about parallelization in QM-MM

From: Ross Walker (ross_at_rosswalker.co.uk)
Date: Fri Aug 17 2007 - 11:58:54 CDT


> USER (fr), PR(25), NI(0), VIRT(251m), RES(28m), SHR(from 10m
> to 7000), S(R),
> %CPU(100), %MEM(0.2), TIME(...), COMMND(sander.MPI)
>
> about the same for the other three nodes. The small %MEM for
> node reflects the
> availability of 4GB per node. As Ross warned, it might be
> that reducing
> drastically the available memory per node (being mainly used
> for "ab initio"
> calculation, things are set so that all available memory can
> be used) might
> accelerate the procedure.

No.... What I said was for 'other codes' such as Gaussian it may improve
performance to tell them NOT to use as much memory. For AMBER QM/MM it will
make no difference you cannot tell it how much memory to use. Sander is
smart enough (due to the fact that we changed over to Fortran 95 ages ago)
to figure out the optimum amount of memory itself. So you really really
really really don't need to concern yourself with anything regarding memory
except to make sure that you at least have enough so that the code is not
swapping to disk!

> One has to know how the program works to judge about that.

You have the source code, go take a look... That's how I learnt it.

> Increase in speed with respect to previous quite similar
> runs, where I had
> forgotten to state "mpirun -nc 4", is about 10-20%.

Speed up for pure QM runs will not be good right now since the matrix
diagonalization step dominates and this is not parallel. If you are feeling
bored (or masochistic) then by all means please parallelize this step for
me.

I am working to improve parallel performance (as well as serial performance)
of this section of the code in AMBER 10 and beyond but how much gets done
really depends on my funding situation over the next year. Right now it
looks like certain people, who shall for the moment remain nameless, at NSF
are intent on destroying SDSC and with it my ability to independently
determine which projects I work on. Instead I am the at mercy of "what I can
get funded." and thus can really only work on this stuff as a hobby.

Note for QM/MM runs with explicit water, periodic boundaries and PME you
should see a reasonable speed up on 4 processors of maybe 2 to 2.5x. It will
depend on the size of the QM system and the size of the MM system. The
matrix diagonalization scales as N^3 and so quickly dominates as the atom
count increases.

You might ask why codes such as Gaussian and NWChem scale reasonably well in
parallel for pure QM calculations so I shall jump the gun on this and
attempt to explain. The key difference here is the time taken to do an SCF
step. For a reasonably large abinitio calculation the time per SCF step can
be seconds to hours. Here it is calculation of the two electron integrals
that typically dominate and NOT the matrix diagonalization. This is easily
parallelized hence why they see speed up. In fact last time I checked
Gaussian 98 just did all it's diagonalization in parallel. G03 might now do
it but I haven't looked. The matricies for ab initio QM calculation are also
orders of magnitude larger in size which means they take much much much
longer to do and so it is easier to do to in parallel, with something like a
block Jacobi method since the communication latency doesn't kill you.

The key point is that for an optimization using ab initio QM you might be
looking at say a days computation time during which the code might do say
300 to 400 individual SCF steps and thus 300 to 400 matrix diagonalizations.
Thus if you can make everything else go to zero time you still have 4
minutes per matrix diagonalization. Thus there is a lot of scope for
improving the efficiency in parallel since there is plenty of work to be
done for a given amount of communication.

On the other hand with semi-empirical, especially when you want to run MD
you do many orders of magnitude more SCF steps. E.g. assume you want to do
1ns of MD at 1fs time step. This is 10^6 iterations. Then assume you need 10
scf steps per MD step, so 10^7 SCF steps. Say you want to do 1ns in 24
hours. This equates to a rate of 8.6ms per MD step. Hence if again you make
everything go to zero you have at most 8.6ms to do a matrix diagonalization.
And this then gets very very hard to do in parallel. E.g. communication
latency is typically around 4 micro seconds or so and bandwidth is maybe 2GB
per sec sustained if we are just talking shared memory (note for comparison
gigabit ethernet has an achievable bandwidth of around 100MB per sec at the
top end). So if your matrix is say 1000,1000 (7.6MB) just distributing it to
the other processors (or reading it into their cache) requires at a minimum
of 8 micro secs latency + 3.7 ms transport time. This leaves less than 5ms
for the computation and even then you have to store the result somewhere.
Hence you should be able to see the underlying problem here and why doing QM
runs in parallel on MD type timescales is very hard. Not to mention all the
issues concerning cache coherency, race conditions etc that accompany
running calculations in parallel.

Note above about 90 QM atoms or so you may see better performance, assuming
you have MKL installed and built against this, to edit the config.h file and
add -DUSE_LAPACK to the FPP_FLAGS line. Then make clean and build sander
again. This uses the lapack diagonalization routine in place of the built in
routine. It can be quicker, depending on machine and lapack installation,
for upwards of around 90 QM atoms or so. For less than this it is likely to
be notably slower. Note this is UNDOCUMENTED, UNSUPPORTED and EXPERIMENTAL
so use at your own risk, make sure you run all the test cases and I suggest
you keep two executables, one for small QM atom counts and one for large. I
am hoping to make this all automatic by Amber 10 so that the code will pick
what it believes will be the faster routine. Again though whether or not
this gets done really depends on the stability of my funding over the coming
months.

> From
> DivCon manual one
> learns that parallelized and non parallelized versions are
> released. Which one
> here I was unable to find out.

My understanding is that all parallel divcon actually does is do each of the
fragments individually on different processors. Hence only the divide and
conquer algorithm runs in parallel and then it is determined by how many
fragments you have. The parallel divide and conquer algorithm does not work
with the divcon interface to sander and it would likely take a lot of effort
to get it to work. This is besides whether or not divide and conquer
approaches are even appropriate for MD simulations. I see no problem with
them for minimization but it seems to me that there may be certain
discontinuities in the gradients that would prevent you running an accurate
MD simulation with it. In addition if you want to make use of this you would
also have to forego periodic boundaries and more importantly PME
electrostatics to use it. It is not clear to me if or how a divide and
conquer algorithm could be made to work with the concept of doing PME for
QM/MM calculations. This would take significant theoretical work to
determine the appropriate mathematics before one could even start to
implement it.

All the best
Ross

/\
\/
|\oss Walker

| HPC Consultant and Staff Scientist |
| San Diego Supercomputer Center |
| Tel: +1 858 822 0854 | EMail:- ross_at_rosswalker.co.uk |
| http://www.rosswalker.co.uk | PGP Key available on request |

Note: Electronic Mail is not secure, has no guarantee of delivery, may not
be read every day, and should not be used for urgent or sensitive issues.

-----------------------------------------------------------------------
The AMBER Mail Reflector
To post, send mail to amber_at_scripps.edu
To unsubscribe, send "unsubscribe amber" to majordomo_at_scripps.edu