AMBER Archive (2005)

Subject: Re: AMBER: parallel AMBER/pmemd installation problem on Opteron

From: Robert Duke (rduke_at_email.unc.edu)
Date: Thu Feb 24 2005 - 22:06:54 CST


Lars -
Sounds like a nice chunk of hardware, and I will be interested in hearing
how things are coming with both reliability and performance. Yes, there are
some problems with mpich not cleaning up properly; I do not know if progress
with this has been made in moving to 1.2.6 or mpich2 (I have not had a
chance to test them yet, but it is on my (rather long) list). I saw some
papers purporting that infiniband is similar to myrinet or quadrix in
performance, or perhaps a bit better; I don't know a lot about it, but I
presume it comes in several speeds and is getting faster. No idea what the
communications reliability issues are. I am going to try to find out as
much as I can about the pgi stuff, but I have long held the opinion that
there are better compilers out there; I am going to test the pathscale
compiler on opterons for pmemd and then move back to looking at pgi issues.
The current release (8) of pmemd will crank out something like 4.3-4.7
nsec/day for 91K atoms const pressure (the factor ix bench) on the ibm p690+
regatta with 1.7 GHz sp4+'s using between 128 and 160 processors, and
numbers in that range for the latest sgi altix machines (64 procs), and a
machine that I can't yet talk numbers on (you can guess - around 128 procs).
I would expect lower performance from interconnects with myrinet speeds, but
perhaps your interconnect is better. Also the extent to which one is an smp
can make a big difference. The altix is so bloody fast because of the speed
of the itanium (vliw cpu) as well as the high level of smp connectedness; I
don't know if they have done a better job on cache coherency issues or what;
I am interested to see that machine pushed out to 128 procs but have not
arranged it yet. I also have what I am currently calling a pmemd 8.1
performance release I am hoping to get out as an executable to selected
supercomputer centers in the next month or so. This is a performance
update, done in preparation for some more expensive functionality we plan to
add in 9 (the "we" being Tom Darden, Lee Pedersen, and myself, as part of
the amber team). With this release I have gotten very close to 7 nsec/day
on factor ix on the p690+ regatta, with comparable numbers on one other
platform (not tested yet on the altix). If you all would be interested, it
sounds like your machine may be large enough to take advantage of the 8.1
performance enhancement, and I could work with you all to make it available
in a bit. In the meantime, while we are all beating on pgi issues, it may
also make sense to look at pathscale if it will run on your platform
(perhaps try a 30 day evaluation release). By all accounts it is
significantly faster; I intend to get some numbers over the next week.
Regards - Bob

----- Original Message -----
From: "Lars Packschies" <packschies_at_rrz.uni-koeln.de>
To: <amber_at_scripps.edu>
Sent: Thursday, February 24, 2005 6:58 PM
Subject: Re: AMBER: parallel AMBER/pmemd installation problem on Opteron

>
>
> --On Mittwoch, Februar 23, 2005 12:27:13 -0500 Robert Duke
> <rduke_at_email.unc.edu> wrote:
>
>> Lars-
>> If anything, I would say pmemd is more stable, or at least as stable, on
>> platforms it has been tested on. However, the pgi compiler is a bit of
>> an unknown quantity; apparently the pgi c compiler is known to be
>> problematic. When you are seeing hangs, that is most likely associated
>> with the network layer (mpi) somehow, and the only reason that pmemd may
>> give more trouble in this area than sander is that it uses nonblocking
>> primitives a lot more, and drives the net connection harder. I would be
>> interested to hear exactly what you are observing. and on what exact
>> setup (hw + sw), and it is possible that my work on the cray machines
>> will shed some light on the problems, as it is a similar setup (opterons,
>> mpich pgi compilers).
>
> Dear Bob,
>
> thanks for your reply. Please see this as some preliminary information,
> especially regarding the compiler issues, I'm still trying some things.
>
> Let me start with the hardware and Compilers first. We use a 128 dual core
> node Sun (v20z) system with 2.2 GHz Opterons, 4Gb Mem per node. IB is from
> Voltaire, all latency measurements and bandwidth tested ok. Virtually no
> issues regarding stability, it just works fine - except "cleaning up"
> problems when jobs crash or hang. Which seems to be kind of normal with
> mpich...?
>
> There are other issues but these are not related to Amber/Pmemd.
>
> We use Rocks as OS (www.rocksclusters.org), Version 3.3.0 and the IB
> drivers package "ibhost-hpc-2.2.0_10-4rhas3.k2.4.21_20", Kernel
> 2.4.21-20.ELsmp. On the compiler side there is the PGI CDK 5.2-2. As far
> as I know Rocks is moving to the 2.6 Kernel soon. Furthermore, Voltaire
> just finished a new version of the ibhost package which I'm going to try
> in a few days (on a small partition of the cluster).
>
> I compiled Amber following the instructions I found here:
> <http://www.pgroup.com/resources/amber/amber8_pgi52.htm>
>
> I could compile Pmemd with your hotfix (ierr to ierr_nmr twice).
>
> If you wish, I can provide you with Amber 8 parallel benchmarks (hb,
> factor ix and jac) with up to 128 processors. It would be of interest for
> me to see if you and others see comparable scaling behavior.
>
> I have to test Pmemd some more and try to isolate more specific sources of
> errors. Up to know it looks too diffuse.
>
>> One other thought. Early on last year, I
>> attempted to run on an opteron workstation and had serious problems with
>> heating; this would cause hangs of the entire system (ie., the machine
>> locks up), and the problem was worse with pmemd because it would drive
>> the opteron fp unit about 50% harder. Any chance your opterons have
>> cooling problems (on a dual p4 with thermoregulated fans, I can hear the
>> fans rev up as you go into a pmemd run - sounds like a jet taxiing out).
>
> Up to now we did not have any problems with overheating. E.g. two weeks
> ago the cluster ran 100 hours on full throttle without getting too hot.
>
> Sincerely,
>
> Lars
>
> --
> Dr. Lars Packschies, Computing Center, University of Cologne
> -----------------------------------------------------------------------
> The AMBER Mail Reflector
> To post, send mail to amber_at_scripps.edu
> To unsubscribe, send "unsubscribe amber" to majordomo_at_scripps.edu
>

-----------------------------------------------------------------------
The AMBER Mail Reflector
To post, send mail to amber_at_scripps.edu
To unsubscribe, send "unsubscribe amber" to majordomo_at_scripps.edu