AMBER Archive (2009)

Subject: RE: [AMBER] reg.error in parallel run

From: Ross Walker (ross_at_rosswalker.co.uk)
Date: Fri Jan 23 2009 - 09:52:20 CST


Hi Balaji,

> I am trying parallel using LAM
>
> I have installed it in two nodes with different IP # .
> and they are having 2cores each
>
> I have created .rhosts file for remote login with out password
> when i give rsh with IP# it works
>
> when I give lamboot
> like
>
> lamboot -v nodes
> it gives the error
> --------------------------------------------------------------------------
> ------
> [balaji_at_ACTINIDE ~/mddna]$ lamboot -v nodes
>
> LAM 7.1.3/MPI 2 C++/ROMIO - Indiana University
>
> n-1<12069> ssi:boot:base:linear: booting n0 (192.1.1.53)
> n-1<12069> ssi:boot:base:linear: booting n1 (192.1.1.54)
> ERROR: LAM/MPI unexpectedly received the following on stderr:
> connect to address 192.1.1.54 port 544: Connection refused
> connect to address 192.1.1.54 port 544: Connection refused
> trying normal rsh (/usr/bin/rsh)

I don't use LAM but I am guessing that it is probably trying to use ssh by
default - at least if it isn't you should probably configure it to use ssh.
Create a phraseless ssh key and put it in the authorized hosts so every node
can ssh to every other node without a password. I would also check the
configuration of lam to make it use ssh.

Also make sure you don't have anything like a Linux firewall turned on. Most
installations install this sort of thing by default these days and this just
about breaks anything decent you want to do. (service iptables stop) -
although you should actually check the configuration and make sure it
properly off by default on all nodes.

> II) when i give
> lamboot -v -ssi boot_rsh_ignore_stderr nodes
>
> it works
>
> lamboot -v -ssi boot_rsh_ignore_stderr nodes
>
> LAM 7.1.3/MPI 2 C++/ROMIO - Indiana University
>
> n-1<12113> ssi:boot:base:linear: booting n0 (localhost)
> n-1<12113> ssi:boot:base:linear: finished
>
> --------------------------------------------------------------------------
> -----------------
>
> but when i give the run from the directory
> by
> mpirun -np 2 $AMBERHOME/amber9/exe/sander -O -i polyAT_wat_min1.in -o
> polyAT_wat_min1.out -p polyAT_wat.prmtop -c polyAT_wat.inpcrd -r
> polyAT_wat_min1.rst -ref polyAT_wat.inpcrd
>
> its not picking up the second machine
> but in the first machine it takes the two processor

Again, I haven't tried this but my reading of this is that you told lam to
ignore any error which is exactly what it did. The errors are still there
trying to run lamboot on the remote node, only this time instead of quiting
it just gave up and left it running on the host node. Hence when you run
mpirun only the host node is running the lam daemon so it runs all the mpi
threads on that node.

Good luck,
Ross

/\
\/
|\oss Walker

| Assistant Research Professor |
| San Diego Supercomputer Center |
| Tel: +1 858 822 0854 | EMail:- ross_at_rosswalker.co.uk |
| http://www.rosswalker.co.uk | PGP Key available on request |

Note: Electronic Mail is not secure, has no guarantee of delivery, may not
be read every day, and should not be used for urgent or sensitive issues.

_______________________________________________
AMBER mailing list
AMBER_at_ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber