AMBER Archive (2002)

Subject: Re: Sander, Setup for Parallel SMP Linux Cluster

From: Tru (tru_at_pasteur.fr)
Date: Wed May 15 2002 - 14:40:45 CDT


On Wed, May 15, 2002 at 03:02:27PM -0400, Jianhui Wu wrote:
> Dear Amber Linux Cluster users,
>
<...>
>
> (1) Sander of amber7 was compiled using pgf77, machine file:
> Machine.pgf77_mpich (download from amber webpage), mpich-1.2.4 installed.
>
> (3) I define the DO_PARALLEL variable as follows.
>
> setenv DO_PARALLEL "$MPICH_HOME/bin/mpirun -np 2 -machinefile
> $MPICH_HOME/util/machines/machines.LINUX"
>
> (4) The files are shared by all nodes and I can rlogin to each node
> without problem.
>
> (5) Problems:
> If mpirun -np 2 or above, the sander job aborted with error message.
>
> For example, if I submit the job with mpirun -np 2 at apple.x.y.ca,
> after I define the machine file machines.LINUX as follow,
>
> "apple.x.y.ca" 2
> "cherry.x.y.ca" 2
> ......
>
>
> (a) I got the error message
> ****************************************************************************
> p0_20194: p4_error: Could not gethostbyname for host "apple.x.y.ca"; may
> be invalid name : 61
> **************************************************************************
>
could you rlogin apple.x.y.ca from cherry.x.y.ca? and reverse?
It could be an internal network name (node0...node15)
VS Fully Qualified Name (apple.x.y.ca, ...).

Try with machines.LINUX containing only:
localhost
>
> (b) There is a file PI20114 exist after I submit the job. This file
> contain
> --------------------------------------------------
> apple.x.y.ca 0 /home/....../amber7/exe/sander
> "apple.x.y.ca" 1 /home/....../amber7/exe/sander
> -------------------------------------------------
Why do you have quotes for machine names?

What does your $MPICH_HOME/util/machines/machines.LINUX
looks like?

>
> (c) If I change the machine file into
>
> apple.x.y.ca:2
> cherry.x.y.ca:2
> .....
>
>
> I got message:
> **************************************
> Host key not found from the list of known hosts.
> Are you sure you want to continue connecting (yes/no)?
> ****************************************************

You are use ssh or rsh ?

> I also try to run lamboot at node1-3, define -np 2 and
> run sander again. Similar problem.

lamboot is for LAM_MPI (an other implementation
of MPI ) not for MPICH !!!
Do not mix them, unless you know what you are doing.

> It seems I don't even get the two processors in the
> same box to work for a single Sander job. As I am new to
> parallel computing, could someone give me some tips as to
> what should I do (install what libray, which software....)
> in order to run Sander job with multiple processors (I have
> 15 dual-processor nodes).
>
> Thanks a lot for your help,
>
> Jian Hui Wu
>
> Lady Davis Insitute

-- 
Dr Tru Huynh          | http://www.pasteur.fr/recherche/unites/Binfs/
mailto:tru_at_pasteur.fr | tel/fax +33 1 45 68 87 37/19
Institut Pasteur, 25-28 rue du Docteur Roux, 75724 Paris CEDEX 15 France