AMBER Archive (2002)Subject: parallel jobs die with no error message from sander
From: lamon_at_lav.Boehringer-Ingelheim.com 
Date: Mon Jul 29 2002 - 12:48:20 CDT
 
 
 
 
> I am having the same problem with Amber7/sander as Mark Hsieh (see below)
 
> in which my jobs stop at random points for no obvious reason.  My error
 
> output for a job running on four processors looks like the following:
 
> 
 
> net_recv failed for fd = 8
 
> p0_25797:  p4_error: net_recv read, errno = : 104
 
> bm_list_25798:  p4_error: interrupt SIGINT: 2
 
> rm_l_3_31636:  p4_error: interrupt SIGINT: 2
 
> p1_19028:  p4_error: interrupt SIGINT: 2
 
> Broken pipe
 
> rm_l_2_7717:  p4_error: interrupt SIGINT: 2
 
> p3_31635:  p4_error: interrupt SIGINT: 2
 
> Broken pipe
 
> rm_l_1_19029:  p4_error: interrupt SIGINT: 2
 
> Broken pipe
 
> /software/mpich-1.2.1/bin/mpirun: line 1: 25797 Broken pipe
 
> /software/amber7/exe_lnx_pll/sander "-i" "md.in" "-o" "md2.out" "-p"
 
> "md.parm" "-c" "md1.rst" "-r" "md2.rst" "-x" "md2.crd" "-O" -p4pg PI25665
 
> -p4wd .
 
> 
 
> 
 
> Here are my observations:
 
> 
 
> 1.  My jobs end unexpectedly with no error message from sander.  
 
> 2.  The problem is intermittent.  If I run it one time, it might die after
 
> 2 ps and another time it will die after 50 ps.  
 
> 	I have some systems in which it does not happen at all.   
 
> 3.  Output from sander stops several minutes before the job exits.
 
> 4.  There are no system error messages indicating a hardware failure.  
 
> 5.  It only happens with jobs running on more than one processor.
 
> 6.  I have tried it on two different linux clusters (compiled with g77)
 
> running two different versions of mpich (1.2.1 and 1.2.4).  
 
> 7.  A job run with the same input parameters and restart file but using
 
> Amber6/sander will run with no problems.  
 
> 
 
> Any suggestions?
 
> 
 
> Thanks,
 
> Lynn
 
> 
 
> 
 
> 
 
> Dr. Lynn Amon
 
> Research Scientist
 
> Boehringer-Ingelheim (Canada) Ltd.
 
> 2100 Cunard Street
 
> Laval (Quebec) Canada H7S 2G5
 
> (450) 682-4640
 
> lamon_at_lav.boehringer-ingelheim.com
 
> 
 
> 
 
> 
 
> 
 
> From: "Mark Hsieh" <hsieh_at_abmaxis.com
 
> <mailto:hsieh_at_abmaxis.com?subject=Re:%20mpirun/sander%20problem&replyto=MP
 
> EKLFFCDNOPEMCNBJCGMEPFCAAA.hsieh_at_abmaxis.com>>
 
> Subject: mpirun/sander problem
 
> Date: Mon, 6 May 2002 13:00:32 -0700
 
> Message-ID: <MPEKLFFCDNOPEMCNBJCGMEPFCAAA.hsieh_at_abmaxis.com>
 
> 
 
> Hi, 
 
> For some reason, my mpirun/sander molecular dynamics simulations are 
 
> stopping at random time points with the following errors: 
 
> p0_4155: (11784.142298) net_recv failed for fd = 6 
 
> p0_4155: p4_error: net_recv read, errno = : 110 
 
> p3_2900: (11783.828019) net_recv failed for fd = 6 
 
> /disk1/app/mpich-1.2.3/bin/mpirun: line 1: 4155 Broken pipe 
 
> /disk1/app/amber7/exe/sander "-O" "-i" "md7.in" "-o" "md7.out" "-p"
 
> "prmtop" 
 
> "-c" "min7.rst" "-r" "md7.rst" "-x" "md7.mdcrd" "-ref" "min7.rst" "-inf" 
 
> "md7.mdinfo" -p4pg pgfile4 -p4wd /disk1/hsieh.tmp/amber.runs/ 
 
> P4 procgroup file is pgfile4. 
 
> p3_2900: p4_error: net_recv read, errno = : 104 
 
> pgfile4 calls up two dual PIII workstations: 
 
> tiger 0 /disk1/app/amber7/exe/sander 
 
> tiger 1 /disk1/app/amber7/exe/sander 
 
> cow2 1 /disk1/app/amber7/exe/sander 
 
> cow2 1 /disk1/app/amber7/exe/sander 
 
> Three identical run produced mdinfo files that indicated 4.1, 19.1 and 7.1
 
> 
 
> ps as their last update. 
 
> Thank you, 
 
> Mark 
 
> 
 
> 
 
 
  
 |