AMBER Archive (2002)

Subject: RE: sander mpirun hangs with 4 CPU, but not with 2

From: lamon_at_lav.Boehringer-Ingelheim.com
Date: Tue Jul 30 2002 - 12:06:05 CDT


        This sounds similar to my problem (see previous post: "parallel jobs
die with no error message from sander"). Do you have a copy of Amber6
available? I have found that everthing runs fine with Amber6/sander. It
seems to me there must have been a change in the parallelization of
Amber7/sander that causes the program to stop unexpectedly. Several people
have reported similar errors.

        Lynn

Dr. Lynn Amon
Research Scientist
Boehringer-Ingelheim (Canada) Ltd.
2100 Cunard Street
Laval (Quebec) Canada H7S 2G5
(450) 682-4640
lamon_at_lav.boehringer-ingelheim.com

> -----Original Message-----
> From: Thomas Steinbrecher
> [SMTP:thomas.steinbrecher_at_physchem.uni-freiburg.de]
> Sent: Tuesday, July 30, 2002 11:09 AM
> To: amber_at_heimdal.compchem.ucsf.edu
> Subject: sander mpirun hangs with 4 CPU, but not with 2
>
> Dear AMBER users,
>
> I have compiled AMBER 7 and MPICH 1.2.4 on a Suse-Linux
> Cluster (1 single CPU master and 2 dual SMP nodes so far)
> connected with a 100Mbit switch.
> The /home /amber and /mpich directories are present on the
> master only and nfs mounted to the nodes. I can rsh and rsh
> true to and from all of my nodes.
>
> After minor problems (see my previous posts :-) I got
> Sander running. However, my System shows a strange
> behaviour, when I increase the number of nodes to run my
> calculation on:
>
> My test case is the DNA_invacuo tutorial from the
> AMBER-homepage, a short Sander MD-Run.
>
> It takes about 120 sec. when run on the master alone or on
> one of the nodes alone (using mpirun -np 1 -nolocal)
>
> It takes about 70 sec when run on both CPUs of one of the
> dual nodes (using mpirun -np 2 -nolocal)
>
> And it takes about 110 sec. when run on two CPUs from
> different computers.
>
> But when I try to use 4 nodes, the calculation takes about
> 100 sec when started with the -nolocal option and never
> finishes at all, when I run it with 4 CPUs without
> -nolocal.
>
> When I stop the calculation the error message:
>
> rm_l_?_???: (???.?????) net_send: could not write to fd=5
> errno: 104
>
> appears.
>
> the ? represent varying numbers.
>
> My machines.LINUX file contains:
> master
> node1:2
> node2:2
>
> and the PI files created by mpirun look as expected.
>
> The mpi cpi test program runs fine with 1-5 processors.
>
> I have tried to increase $P4_GLOBMEMSIZE to 10000000 as
> mentioned in older posts, but nothing changed.
>
> I suspect something with my network is wrong (because the
> run on two separate CPUs takes so much longer than on a
> Dual machine) but I have no clue what the error message
> means.
>
> Has someone of you experienced similar problems or some
> hints what I should look for?
>
> Sorry for the long posting, I'm never sure what information
> might be important and what not.
>
> Kind Regards,
>
> Thomas