AMBER Archive (2009)

Subject: RE: [AMBER] MKL error ?

From: Ross Walker (ross_at_rosswalker.co.uk)
Date: Thu Feb 05 2009 - 12:29:45 CST


Hi Marek,

> #1
> your solution "OMP_NUM_THREADS=1" is working !
>
> When I wrote "OMP_NUM_THREADS=1" and ofcourse "export OMP_NUM_THREADS" on
> the
> commandline before starting amber tests, all four tests
> "test.serial,test.serial.QMMM,
> test.parallel, test.parallel.QMMM" passed without any problems !
>
> It seems to me that the MKL problem is probably mainly in connection with
> sander when igb=1,
> because I have compilled NAB with -openmp flag and I can use it without
> any problems for example
> with OMP_NUM_THREADS=8. I just tested it on the common normal mode
> analysis program:

The issue comes about because versions of MKL 10.0 and onwards contain
parallelization using openMP threads. Not all of the MKL routines include
this so not everything is affected. GB in sander is affected because it
makes extensive use of calls to MKL (mainly the vector functions). The same
is true for QMMM where it uses the matrix diagonalizers in MKL. This of
course can cause problems since the sander code uses MPI for
parallelization. For example consider a job running with 4 mpi threads on a
quad core machine. The code internally might call a vector exponential
routine. The MPI code would issue 4 calls to vexp with a quarter of the
array each time. Hence each processor works on a quarter of the array. But
then the Intel MKL openMP would kick in and say fire up 4 threads for each
call to vexp because it thinks you have 4 cpus available (it knows nothing
about the other MPI threads) the net result is you get 16 threads running on
4 processors that all thrash like crazy and your net performance goes down.

Some argue that what you should do is run 1 MPI process per node and then
ncores openmp threads within a node - so called hybrid programming that was
supposed to be the holy grail for multicore chips and save us all but in
practice it doesn't really work.

However, it is useful to have the openMP MKL available since some things
like the dsyevr diagonalizer is openMP parallel in MKL but is not MPI
parallel in the code. Only the master thread calls the diagonalization and
all other threads block at that point. Hence with OMP_NUM_THREADS set to 4
the code would idle 3 of the MPI processes at the diagonalization and in
their place a total of 4 openMP threads get spawned by the master. This
works well in some cases and badly in others, mainly dependent on how the
MPI implementation does blocking. If it just spins the processors and keeps
checking interupts then sitting at a barrier takes 100% of the processors
time so it can't execute the openMP thread instead.

This essentially explains why only specific test cases fail. The question is
though why they fail. My understanding was that if omp_num_threads was unset
that it would default to 1 in the MKL code. And indeed I think this is what
happens if you link statically - it can equally though default to the number
of cpus you have which would be bad (I think we should probably update the
amber code to have you specify it in the mdin file and it override any
environment variable). However, Intel has this terrible habit of continually
changing the interfaces to their compilers and MKL libraries so every new
version behaves in a different way. I should probably read through the
massive MKL manual for 10.1 at some point and I assume some explanation of
how openmp threading is handled is in there but then come 10.2 it will have
all changed again :-(. The simplest approach for the moment is to force
omp_num_threads to 1. However, I really think it is a bug in MKL because if
you compile statically there is no problem, if you set omp_num_threads = 1
there is no problem and if you set it to 2 there is no problem but if it is
unset and you linked dynamically it crashes - but only on specific MKL
routines. Hence it has to be an issue within the MKL code that does the
wrong thing if omp_num_threads is not set.

> ---------------
> conjgrad()
> newton()
> nmode()
> --------------
>
> And I saw clearly that in the case of conjgrad() the application used
> really 8 CPUs.
> No problems appeard even if I turned gb=1 !

Yes I believe that the openmp in here is all hand coded in NAB and it
doesn't use MKL with its 'implicit' openmp. Thus you don't see the problem
in here. In sander there is no openmp parallelization (only MPI) so the only
openmp threads that get created are internally within MKL threads and since
the issue is within MKL itself hence why the problem is seen only in sander.

> Can you say in very short why this your solution works and prevent
> Amber/Sander from
> the error ?

Okay in very short - "Because Intel is for some strange reason not dealing
with the situation where omp_num_threads is not set so 'perhaps' setting it
to 1 forces it to skip the call to the non-existent routine in mkl."

My advice would be though that on any machine you ever use these days you
should ALWAYS hard wire omp_num_threads to 1 this way you always know what
it is doing since lots of libraries are starting to include openmp and it
can cause all sorts of problems if you are running MPI jobs and don't expect
it. Then if you specifically want to run multiple openMP threads you
manually set omp_num_threads in the script that runs that job. The default
behavior when omp_num_threads is not consistent hence the problems.

> MLK func load error: /opt/intel/mkl/10.0.011/lib/em64t/libmkl_vml_mc.so:
> undefined symbol: vmlGetErrorCallBack

I will probably pass this along to some friends I have at Intel when I get a
chance and see if they can confirm the problem / comment on it.

> #2
>
> Of course that I also tried recommended static compilation since it seems
> to me as a little
> "cleaner" solution than #1. But unfortunately after I succesfully
> compilled AmberTools and
> Amber in serial with -static flag I finally got in troubles during static
> compilation
> of PARALLEL version of Amber - please see the errors below:
>
>
> /opt/intel/impi/3.1/lib64/libmpiif.a(allgathervf.o): In function
> `MPI_ALLGATHERV':
> allgathervf.c:(.text+0x63): undefined reference to `PMPI_Allgatherv'
 
> of course that I tried "make clean" before this compilation but it simply
> does'n work :((

This is because your mpi implementation was not built statically as well so
only shared objects are available so when it tries to link things it can't
find the static versions of the libraries. The solution is to recompile the
MPI and have that link statically. How to do this varies by MPI
implementation but it is often an argument you give to configure. For
example with mpich2 you set the environment variable LDFLAGS=-static and
CXXLDFLAGS=-static before you run the configure script. Then you should be
able to build AMBER in parallel statically and link it against the static
MPI.
 
Sorry if this email was a bit long and rambling but hopefully it explains
what is going on. I haven't fully characterized the problem myself yet (or
even how to deal with the openmp within mkl correctly yet - i.e. to have it
turned off in some places (such as vexp calls) but have it on in others such
as dsyevr calls).

All the best
Ross

/\
\/
|\oss Walker

| Assistant Research Professor |
| San Diego Supercomputer Center |
| Tel: +1 858 822 0854 | EMail:- ross_at_rosswalker.co.uk |
| http://www.rosswalker.co.uk | PGP Key available on request |

Note: Electronic Mail is not secure, has no guarantee of delivery, may not
be read every day, and should not be used for urgent or sensitive issues.

_______________________________________________
AMBER mailing list
AMBER_at_ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber