| AMBER Archive (2002)Subject: Re: Error running AMBER6 on Beowulf cluster
From: Carlos Simmerling (carlos_at_ilion.bio.sunysb.edu)Date: Fri Nov 22 2002 - 17:33:49 CST
 
 
 
 
we've successfully used the PGI compiler with mpich
but have not seen this type of behavior. We have not used
 myrinet, but we have used giganet clan and it also worked.
 did your myrinet setup come with any types of testing
 programs? I recommend getting those to work before
 running amber.
 Carlos
 
 ===================================================================
Carlos L. Simmerling, Ph.D.
 Assistant Professor           Phone: (631) 632-1336
 Center for Structural Biology Fax:   (631) 632-1555
 Stony Brook University        Web:   http://comp.chem.sunysb.edu/carlos
 Stony Brook, NY 11794-5115    E-mail: carlos.simmerling_at_stonybrook.edu
 ===================================================================
 
 ----- Original Message -----
From: <arubin_at_unmc.edu>
 To: <amber_at_heimdal.compchem.ucsf.edu>
 Sent: Friday, November 22, 2002 4:38 PM
 Subject: Re: Error running AMBER6 on Beowulf cluster
 
 >
> Dear Dr. Case,
 >     Thanks a lot for your recommendations. This is really our first
 >   parallel run on this Beowulf cluster (RedHat, Myranet, PG compiler). A
 >   calculation using two processors stops abnormally with similar error
 >   message (see below). The calculation using one processor runs
 >   successfully.
 >
 >   ********************************************************************
 >   # message ? sander.07732
 >   Warning: no access to tty (Bad file descriptor).
 >   Thus no job control in this shell.
 >   Name "main::arch" used only once: possible typo at
 >   /home/usr/mpich.pgi/bin/mpirun.ch_gm.pl line 26.
 >   |  Atom division among processors:
 >   |         0    5610   11220
 >   |  Atom division among processors for gb:
 >   |         0    5610   11220
 >   |  Running AMBER/MPI version on    2 nodes
 >
 >
 >        Sum of charges from parm topology file =   0.00000000
 >        Forcing neutrality...
 >    ---------------------------------------------------
 >    APPROXIMATING switch and d/dx switch using CUBIC SPLINE INTERPOLATION
 >    using   5000.0 points per unit in tabled values
 >    TESTING RELATIVE ERROR over r ranging from 0.0 to cutoff
 >   | CHECK switch(x): max rel err =   0.3242E-14   at   2.436720
 >   | CHECK d/dx switch(x): max rel err =   0.8064E-11   at   2.761360
 >    ---------------------------------------------------
 >        Total number of mask terms =      12578
 >        Total number of mask terms =      25156
 >   |    Total Ewald setup time =   0.14000000
 >
 >   ------------------------------------------------------------------------
 ------
 >
 >   Unit    7 Error on OPEN:
 >      Unit    7 Error on OPEN:
 >   [1] MPI Abort by user Aborting program !
 >   [1] Aborting program!
 >   done
 >   ***************************************************************
 >
 >   Thanks a lot.
 >   Sincerely yours,
 >
 >
 > Alexander Rubinshtein, Ph.D.
 > UNMC Eppley Cancer Center
 > Molecular Modeling Core Facility
 > _________________________________
 > University of Nebraska Medical Center
 > 986805 Nebraska Medical Center
 > Omaha, Nebraska 68198-6805
 > USA
 > Office:  (402) 559-5319
 > Fax:      (402) 559-4651
 > E-mail: arubin_at_unmc.edu
 > WWW: http://www.unmc.edu/Eppley
 >
 >
 >
 >                       "David A.  Case"
 >                       <case_at_scripps.edu        To:       arubin_at_unmc.edu
 >                       >                        cc:
 amber_at_heimdal.compchem.ucsf.edu
 >                                                Subject:  Re: Error running
 AMBER6 on Beowulf cluster
 >                       11/21/2002 07:31
 >                       PM
 >                       Please respond to
 >                       amber
 >
 >
 >
 >
 >
 >
 > On Thu, Nov 21, 2002, arubin_at_unmc.edu wrote:
 > >
 > >     We ran into a problem with MD simulation using AMBER6 on the Beowulf
 > >   cluster (RedHat, Myranet, PG compiler). To run an MPI job on 8
 > processors
 > >   we used "mpirun.ch_gm" script. Calculation stops abnormally. Could you
 > >   help us to find out what is going on? If anyone has some idea? I am
 > >   attaching the output file and error message(see below).
 > >   ********************************************************************
 > >    ---------------------------------------------------
 > >    APPROXIMATING switch and d/dx switch using CUBIC SPLINE INTERPOLATION
 > >    using   5000.0 points per unit in tabled values
 > >    TESTING RELATIVE ERROR over r ranging from 0.0 to cutoff
 > >    ---------------------------------------------------
 > >    APPROXIMATING switch and d/dx switch using CUBIC SPLINE INTERPOLATION
 > >    using   5000.0 points per unit in tabled values
 > >    TESTING RELATIVE ERROR over r ranging from 0.0 to cutoff
 > >    ---------------------------------------------------
 >
 > etc.
 >
 > I don't understand why all of the processors are printing out messages
 > like this...it should only happen for processor number 0.  Somehow, all
 > of your jobs think they are processor 0 somehow, but I don't understand
 > why.
 >
 > >   Unit    7 Error on OPEN:
 >
 > This is the real error.  Unit 7 is used for the "mdinfo" file.  I'm
 > guessing
 > that several processors are all trying to write to this file at once, but
 > again, I don't know why...
 >
 >
 > Obvious questions:
 >
 > 1. Can you run a short job on one processor?  Can you run a short job
 > on two processors?
 >
 > 2. Is this your first parallel run on this hardware/OS/software
 > configuration?
 > That is, do other jobs work and this particular one fails, or do all
 > parallel
 > Amber jobs fail, etc.
 >
 > 3. You will probably have to go into sander.f, and (someplace after
 > mpi_init(), print out "mytaskid" for each processor; also the value of the
 > "master" variable (which should be true on node0, false everywhere else.)
 > Then maybe later on, say inside runmd(), print the same info.  Maybe
 > something in memory is being clobbered.
 >
 > ..good luck...dac
 >
 > --
 >
 > ==================================================================
 > David A. Case                     |  e-mail:      case_at_scripps.edu
 > Dept. of Molecular Biology, TPC15 |  fax:          +1-858-784-8896
 > The Scripps Research Institute    |  phone:        +1-858-784-9768
 > 10550 N. Torrey Pines Rd.         |  home page:
 > La Jolla CA 92037  USA            |    http://www.scripps.edu/case
 > ==================================================================
 >
 >
 >
 >
 >
 >
 >
 
 
 
 |