AMBER Archive (2006)

Subject: RE: AMBER: problems for running sander.MPI

From: Ross Walker (ross_at_rosswalker.co.uk)
Date: Mon Oct 30 2006 - 12:40:06 CST


Hi Christophe,
 
I think you may have to turn off Mosix altogether if you want to run MPI
jobs on the cluster. The problem is that mosix can move threads between
machines in your cluster and it could well do this to mpi threads as well.
 
E.g. you start a 4 thread mpi job on nodes 3,4,7 and 8. This is fine and the
mpi implementation knows where to send the messages from mpi communications.
Then it is possible that mosix migrates the thread on node 8 to node 9. Then
when one of the other threads tries to communicate with node 8 it finds that
thread is no longer there, the socket has been closed and thus you get some
kind of MPI write error.
 
It may be possible to setup mosix in such a way that it maintains affinity
for MPI threads. Perhaps there is an initial command you can put in front of
mpirun. However, I think such an approach is likely to be really unstable
and has the potential for all kinds of problems.
 
I would look to disable mosix (at least on the nodes allocated to the mpi
job) before running the mpi job and then re-enable it once the mpirun is
complete.
 
All the best
Ross

/\
\/
|\oss Walker

| HPC Consultant and Staff Scientist |
| San Diego Supercomputer Center |
| Tel: +1 858 822 0854 | EMail:- ross_at_rosswalker.co.uk |
| http://www.rosswalker.co.uk <http://www.rosswalker.co.uk/> | PGP Key
available on request |

Note: Electronic Mail is not secure, has no guarantee of delivery, may not
be read every day, and should not be used for urgent or sensitive issues.

 

  _____

From: owner-amber_at_scripps.edu [mailto:owner-amber_at_scripps.edu] On Behalf Of
Christophe Deprez
Sent: Monday, October 30, 2006 10:27
To: amber_at_scripps.edu
Subject: Re: AMBER: problems for running sander.MPI

Dear Ross,

Thanks for your suggestions.

I initially didn't mention our Fedora Core nodes were running OpenMosix
(2.4.24), which we found out was certainly part of the problem! We
eventually switched from OpenMPI to MPICH2 (compiled with
--enable-threads=single) and stopped OpenMosix on our nodes. This has been
our stablest configuration so far.

I am now testing another configuration using the latest OpenMosix (2.4.26)
under CentOS 3.8, which looks fine too!

Christophe

Ross Walker wrote:

Dear Christophe,
 
This is my first experience with openmpi. Which openmpi test suite are you
refering to? Where is it documented?

I have never used Openmpi myself either. I tend to use mpich2. There should
be some kind of test suite distributed with the source code though. Check
the install docs. Typically you do something like: ./configure; make; make
test; make install
 
It is the make test bit that you need to lookup.
 
Unfortunately, the error is not always from the same node!
 
HHmmm, then it could be the switch but could also be an issue with the
openmpi installation. Try downloading mpich2 and trying that out instead and
see if it works.
 
You could also try building pmemd in $AMBERHOME/src/pmemd and then testing
this. If you see similar problems then it is definately an issue with the
openmpi installation or the hardware.
 
All the best
Ross

/\
\/
|\oss Walker

  _____

From: owner-amber_at_scripps.edu [mailto:owner-amber_at_scripps.edu] On Behalf Of
Christophe Deprez
Sent: Thursday, October 12, 2006 06:55
To: amber_at_scripps.edu
Subject: Re: AMBER: problems for running sander.MPI

Ross Walker wrote:

Hi Qizhi

enode05:03662] mca_btl_tcp_frag_send: writev failed with errno=104

(enode05 is one of the node names of the cluster.)

Normmally, there is no problem for minimization and constant

NVT steps.

The problems often occur during constant NPT and production run.

Hi Ross, and thanks for your reply.
I'm working as sysadmin with Qizhi to troubleshoot this issue.

This looks like a hardware problem to me. Unfortunately a Google search

sheds little light. E.g.:

http://www.open-mpi.org/community/lists/users/2006/02/0684.php

Have you seen this with any other codes? Can you run the openmpi test suite

successfully?

This is my first experience with openmpi. Which openmpi test suite are you
refering to? Where is it documented?

I would check to see if the error is always from the same node. If you

unplug that node and use the remaining nodes do you see the problem.

Unfortunately, the error is not always from the same node!

I would also try compiling with g95 instead of gfortran. While it appears

that gfortran is now mature enough to compile Amber I don't know if it has

been thoroughly tested. You will probably have to recompile openmpi with g95

as well.

I'll give this a try.

Thanks for your suggestions

-- 

Christophe Deprez christophe.deprez_at_bri.nrc.ca

----------------------------------------------------------------------

Institut de Recherche en Biotechnologies / Biotech. Research Institute

6100 Royalmount, Montréal (QC) H4P 2R2, Canada Tel: (514) 496-6164

----------------------------------------------------------------------- The AMBER Mail Reflector To post, send mail to amber_at_scripps.edu To unsubscribe, send "unsubscribe amber" to majordomo_at_scripps.edu