AMBER Archive (2008)

Subject: RE: AMBER: amber 9 on Intel Harpertown

From: Ross Walker (ross_at_rosswalker.co.uk)
Date: Wed Aug 13 2008 - 09:24:47 CDT


Hi Geoff,

 

I have not encountered such problems before, I assume this is PMEMD you are
testing. This looks like a problem with openMPI and/or your infiniband
fabric - assuming of course that the problem you are running is large enough
to run on 256 cpus.

 

Note however that openMPI's performance is pretty aweful, especially on >64
cpus. Having bought a nice interconnect you should probably use a decent MPI
implementation such as MVAPICH or even better Intel MPI on which I have
recently seen extremely good performance (against OFED 1.3, with some tweaks
to the environment variables). I am hoping to get some benchmarks up (for
PMEMD 10) on the AMBER website shortly.

 

All the best

Ross

 

From: owner-amber_at_scripps.edu [mailto:owner-amber_at_scripps.edu] On Behalf Of
Geoff Wood
Sent: Wednesday, August 13, 2008 2:46 AM
To: amber_at_scripps.edu
Subject: AMBER: amber 9 on Intel Harpertown

 

Dear Reflector,

 

We are currently testing amber 9 on a new machine. We are having problems
with the MPI communications and I was wondering if there are any known
compatibility issues with the machine and the way amber is compiled before
we start looking at hardware and driver issues, any comments or help would
be much appreciated.

 

 The basic specks of the machine are as follows:

 

128 compute nodes, each with two quad-core Intel Harpertown 3.0 GHz

processors, for a total of 1024 cores;

Voltaire 20 Gbit/s InfiniBand fabric used both to share files thru GPFS and
to run MPI jobs.

 

11:07:15 cal2 root - /root > rpmg kernel

kernel-smp-2.6.16.46-0.12

kernel-ib-devel-1.3-2.6.16.46_0.12_smp.volt2986

kernel-smp-2.6.16.54-0.2.5

kernel-ib-1.3-2.6.16.46_0.12_smp.volt2986

kernel-source-2.6.16.46-0.12

kernel-source-2.6.16.54-0.2.5

 

We have successfully compiled amber 9 using openmpi/1.2.6_gcc-4.1.2 and
intel fortan and c++ compilers. We ran the tests without problems, however,
when scaling jobs to use 128-256 cpus we encounter MPI problems. The error
is the following:

 

The InfiniBand retry count between two MPI processes has been

exceeded. "Retry count" is defined in the InfiniBand spec 1.2

(section 12.7.38):

 

    The total number of times that the sender wishes the receiver to

    retry timeout, packet sequence, etc. errors before posting a

    completion error.

 

This error typically means that there is something awry within the

InfiniBand fabric itself. You should note the hosts on which this

error has occurred; it has been observed that rebooting or removing a

particular host from the job can sometimes resolve this issue.

 

Two MCA parameters can be used to control Open MPI's behavior with

respect to the retry count:

 

 

 

Thanks in advance.

 

----------------------------------------------------------------------------
------------------------------------------------------

Dr Geoffrey Wood

Ecole Polytechnique Fédérale de Lausanne
http://lcbcpc21.epfl.ch/Group_members/geoff/

SB - ISIC - LCBC

BCH 4108
tel: +41 21 693 03 23

CH - 1015 Lausanne e-mail:
geoffrey.wood_at_epfl.ch

----------------------------------------------------------------------------
------------------------------------------------------

 

-----------------------------------------------------------------------
The AMBER Mail Reflector
To post, send mail to amber_at_scripps.edu
To unsubscribe, send "unsubscribe amber" (in the *body* of the email)
      to majordomo_at_scripps.edu