AMBER Archive (2002)

Subject: parallel jobs die with no error message from sander

From: lamon_at_lav.Boehringer-Ingelheim.com
Date: Mon Jul 29 2002 - 12:48:20 CDT


> I am having the same problem with Amber7/sander as Mark Hsieh (see below)
> in which my jobs stop at random points for no obvious reason. My error
> output for a job running on four processors looks like the following:
>
> net_recv failed for fd = 8
> p0_25797: p4_error: net_recv read, errno = : 104
> bm_list_25798: p4_error: interrupt SIGINT: 2
> rm_l_3_31636: p4_error: interrupt SIGINT: 2
> p1_19028: p4_error: interrupt SIGINT: 2
> Broken pipe
> rm_l_2_7717: p4_error: interrupt SIGINT: 2
> p3_31635: p4_error: interrupt SIGINT: 2
> Broken pipe
> rm_l_1_19029: p4_error: interrupt SIGINT: 2
> Broken pipe
> /software/mpich-1.2.1/bin/mpirun: line 1: 25797 Broken pipe
> /software/amber7/exe_lnx_pll/sander "-i" "md.in" "-o" "md2.out" "-p"
> "md.parm" "-c" "md1.rst" "-r" "md2.rst" "-x" "md2.crd" "-O" -p4pg PI25665
> -p4wd .
>
>
> Here are my observations:
>
> 1. My jobs end unexpectedly with no error message from sander.
> 2. The problem is intermittent. If I run it one time, it might die after
> 2 ps and another time it will die after 50 ps.
> I have some systems in which it does not happen at all.
> 3. Output from sander stops several minutes before the job exits.
> 4. There are no system error messages indicating a hardware failure.
> 5. It only happens with jobs running on more than one processor.
> 6. I have tried it on two different linux clusters (compiled with g77)
> running two different versions of mpich (1.2.1 and 1.2.4).
> 7. A job run with the same input parameters and restart file but using
> Amber6/sander will run with no problems.
>
> Any suggestions?
>
> Thanks,
> Lynn
>
>
>
> Dr. Lynn Amon
> Research Scientist
> Boehringer-Ingelheim (Canada) Ltd.
> 2100 Cunard Street
> Laval (Quebec) Canada H7S 2G5
> (450) 682-4640
> lamon_at_lav.boehringer-ingelheim.com
>
>
>
>
> From: "Mark Hsieh" <hsieh_at_abmaxis.com
> <mailto:hsieh_at_abmaxis.com?subject=Re:%20mpirun/sander%20problem&replyto=MP
> EKLFFCDNOPEMCNBJCGMEPFCAAA.hsieh_at_abmaxis.com>>
> Subject: mpirun/sander problem
> Date: Mon, 6 May 2002 13:00:32 -0700
> Message-ID: <MPEKLFFCDNOPEMCNBJCGMEPFCAAA.hsieh_at_abmaxis.com>
>
> Hi,
> For some reason, my mpirun/sander molecular dynamics simulations are
> stopping at random time points with the following errors:
> p0_4155: (11784.142298) net_recv failed for fd = 6
> p0_4155: p4_error: net_recv read, errno = : 110
> p3_2900: (11783.828019) net_recv failed for fd = 6
> /disk1/app/mpich-1.2.3/bin/mpirun: line 1: 4155 Broken pipe
> /disk1/app/amber7/exe/sander "-O" "-i" "md7.in" "-o" "md7.out" "-p"
> "prmtop"
> "-c" "min7.rst" "-r" "md7.rst" "-x" "md7.mdcrd" "-ref" "min7.rst" "-inf"
> "md7.mdinfo" -p4pg pgfile4 -p4wd /disk1/hsieh.tmp/amber.runs/
> P4 procgroup file is pgfile4.
> p3_2900: p4_error: net_recv read, errno = : 104
> pgfile4 calls up two dual PIII workstations:
> tiger 0 /disk1/app/amber7/exe/sander
> tiger 1 /disk1/app/amber7/exe/sander
> cow2 1 /disk1/app/amber7/exe/sander
> cow2 1 /disk1/app/amber7/exe/sander
> Three identical run produced mdinfo files that indicated 4.1, 19.1 and 7.1
>
> ps as their last update.
> Thank you,
> Mark
>
>