AMBER Archive (2009)

Subject: RE: [AMBER] Script for parallel runs 2

From: Ross Walker (ross_at_rosswalker.co.uk)
Date: Mon May 04 2009 - 10:39:27 CDT


Hi Catherine,

This will be a difficult one to debug since often such problems are a
function of the way the cluster is setup and not necessarily sander. It is
likely getting stuck at some MPI barrier... Are you sure your MPI
environment works properly? Have you run the test cases in parallel on
2cpus?

Is it an internodal problem only? I.e. if you run 2 processes on the same
node does it work?

I would start with the test cases and make sure these all pass.

All the best
Ross

> -----Original Message-----
> From: amber-bounces_at_ambermd.org [mailto:amber-bounces_at_ambermd.org] On
> Behalf Of Catein Catherine
> Sent: Monday, May 04, 2009 2:42 AM
> To: amber_at_ambermd.org
> Subject: [AMBER] Script for parallel runs 2
>
>
> Dear Sir/Madam,
>
>
> I tried to specify the number of processors used (in power of 1, 2,
> i.e. 2, 4, 8) with the scrips (pasted below under ****).
>
>
>
> When I used one node only "#PBS -l nodes=1:ppn=1" in the following
> scripps, the jobs finished without error. However, when I use 2, 4, 8
> nodes in the following scripps, i.e. "#PBS -l nodes=1:ppn=2 or 4 or 8".
> , the jobs stopped in the following section:
>
>
>
> =======================================================================
> ======================
> >
> > Ewald parameters:
> > verbose = 0, ew_type = 0, nbflag = 1, use_pme = 1
> > vdwmeth = 1, eedmeth = 1, netfrc = 1
> > Box X = 53.071 Box Y = 53.071 Box Z = 53.071
> > Alpha = 109.471 Beta = 109.471 Gamma = 109.471
> > NFFT1 = 54 NFFT2 = 54 NFFT3 = 54
> > Cutoff= 10.000 Tol =0.100E-04
> > Ewald Coefficient = 0.27511
> > Interpolation order = 4
> >
> > LOADING THE CONSTRAINED ATOMS AS GROUPS
> >
> >
> > 5. REFERENCE ATOM COORDINATES
> >
> >
> > ----- READING GROUP 1; TITLE:
> > Keep DNA fixed with weak restraints
> >
> > GROUP 1 HAS HARMONIC CONSTRAINTS 10.00000
> > GRP 1 RES 1 TO 20
> > Number of atoms in this group = 638
> > ----- END OF GROUP READ -----
> >
> > ---------------------------------------------------------------------
> -----------
> > 3. ATOMIC COORDINATES AND VELOCITIES
> > ---------------------------------------------------------------------
> -----------
> >
> >
> > begin time read from input coords = 0.000 ps
> >
> > Number of triangulated 3-point waters found: 2968
> >
> =======================================================================
> ====================
>
>
> This is the script that I used, please kindly instruct if we should
> modify the scripts so that the calculations can be done accoridingly to
> use the parallel nodes in the supercomputer system.
>
>
>
> ***********************************************************************
> *********************
> >
> > #!/bin/sh
> > ### Job name
> > #PBS -N test-amber
> > ### Declare job non-rerunable
> > #PBS -r n
> >
> > ### Queue name (qprod or qdev)
> >
> > ### qprod is the queue for running production jobs.
> > ### 22 nodes can run jobs in this queue.
> > ### Each job in this queue can use 1-8 nodes.
> > ### Parallel jobs will be favoured by the system.
> >
> > ### qdev is the queue for program testing.
> > ### 2 nodes can run jobs in this queue.
> > ### Each job in this queue can use 1 nodes.
> >
> > #####PBS -q qprod
> > #PBS -q parallel
> >
> > ### Wall time required. This example is 30 min
> > #PBS -l walltime=04:10:00
> >
> > ### Number of nodes
> >
> > ### The following means 1 node and 1 core.
> > ### Clearly, this is for serial job
> > ###PBS -l nodes=1:ppn=1
> >
> > ### The following means 1 nodes required. Processor Per Node=8,
> > ### i.e., total 8 CPUs needed to be allocated.
> > ### ppn (Processor Per Node) can be either 1 or 2 or 4 or 8.
> > #PBS -l nodes=1:ppn=2
> >
> > ### Another example
> > ### 2 nodes required. Processor per node=8, total 16 cores
> > ### need to be allocated.
> > ###PBS -l nodes=2:ppn=8
> >
> >
> > #The following stuff will be executed in the first allocated node.
> > #Please don't modify it
> >
> > echo $PBS_JOBID : `wc -l < $PBS_NODEFILE` CPUs allocated: `cat
> $PBS_NODEFILE`
> > cd $PBS_O_WORKDIR
> > # Define number of processors
> > NPROCS=`wc -l < $PBS_NODEFILE`
> >
> > echo ===========================================================
> > echo "Job Start Time is `date "+%Y/%m/%d -- %H:%M:%S"`"
> >
> >
> > # Run the parallel MPI executable "a.out"
> > #time mpirun -np $NPROCS ./a.out -machinefile $PBS_NODEFILE >
> ${PBS_JOBNAME}.`echo ${PBS_JOBID} | sed "s/.supercomputer.u//" `
> >
> > PGMSTR="/share1/amber8/exe/sander -O -i polyAT_wat_md1.in -o
> polyAT_wat_md1.out -p polyAT_wat.prmtop -c polyAT_wat_min2.rst -r
> polyAT_wat_md1.rst -x polyAT_wat_md1.mdcrd -ref polyAT_wat_min2.rst "
> >
> > echo ${PGMSTR}
> > mpirun -np $NPROCS -machinefile $PBS_NODEFILE \
> > ${PGMSTR} > ${PBS_JOBNAME}.`echo ${PBS_JOBID} | sed
> "s/.supercomputer.u//" `
> >
> >
> > echo "Job Finish Time is `date "+%Y/%m/%d -- %H:%M:%S"`"
> >
> >
> ***********************************************************************
> **********************
> >
> >
> >
> > Best regards and many thanks,
> >
> > _________________________________________________________________
> > Drag n' drop-Get easy photo sharing with Windows LiveT Photos.
> >
> >
> http://www.microsoft.com/windows/windowslive/products/photos.aspx______
> _________________________________________
> > AMBER mailing list
> > AMBER_at_ambermd.org
> > http://lists.ambermd.org/mailman/listinfo/amber
>
>
> _________________________________________________________________
> Windows LiveT: Keep your life in sync. Check it out!
> http://windowslive.com/explore?ocid=TXT_TAGLM_WL_t1_allup_explore_01200
> 9_______________________________________________
> AMBER mailing list
> AMBER_at_ambermd.org
> http://lists.ambermd.org/mailman/listinfo/amber

_______________________________________________
AMBER mailing list
AMBER_at_ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber