AMBER Archive (2005)
Subject: AMBER: Amber Performance in Parallel on Itanium

From: Robert J. Woods (rwoods_at_ccrc.uga.edu)
Date: Wed Mar 30 2005 - 09:32:10 CST

Next message: Wen Li: "Re: AMBER: align snapshots to a reference"
Previous message: David A. Case: "Re: AMBER: "The system has extended beyond" error"
Next in thread: Ross Walker: "RE: AMBER: Amber Performance in Parallel on Itanium"
Reply: Ross Walker: "RE: AMBER: Amber Performance in Parallel on Itanium"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

Hi Folks,
In the course of trying to sort out very poor scaling of sander (8) on our Itanium system we have noticed some very puzzling behavior. I hope that someone can shed some light.

The Itanium cluster is running RHEL3 Update4 with Scali for management. The MPI traffic is going out over Myrinet and we are using a 10/100Mb LAN for management and NFS. We are using the Intel compilers to build Amber but are not using the Intel math libraries or any others for that matter.

The shared Amber8 directory is NFS mounted as well as the user's working directory. We are seeing relatively poor scaling (3.2 fold speed-up on 8 cpu's). For comparison, on an essentially equivalent implementation on our Xeon cluster we see reasonable scaling (6.0 fold on 8 cpus).

On the Itanium cluster, what we do see is that when we start an n-way parallel job, n-1 of the processors are pegged at ~100% utilization, however, one of processors starts very high and then falls to about 50% and stays there. We have run ethereal on the head node to watch packets and as the code starts up, of course we see lots of NFS queueries to all of the nodes. Then as that one processor falls to around ~50% use we see lots of NFS communications between the head node and the node that has the low performing processor. Once the poorly performing CPU drops to 50% you can look at the 100Mb switch and see enormous amounts of traffic between the head node and it.

This behavior is not present on the Xeon system, on which all CPUs appear to run at about 100%.

Could this problem simply be due to our use of NFS as a way to share the required files?
Should we consider distributing the data over all of the nodes and have amber access local files? Any help or insight that you can provide would be greatly appreciated.

Rob Woods

Robert J. Woods, Ph.D.

Associate Professor of Biochemistry Voice: (706) 542-4454

and Molecular Biology FAX: (706) 542-4412

University of Georgia http://glycam.ccrc.uga.edu <http://glycam.ccrc.uga.edu/>

Complex Carbohydrate Research Center

315 Riverbend Road "One small step for Man,

Athens, GA 30602 one giant leap for Man-9"

----------------------------------------------------------------------- The AMBER Mail Reflector To post, send mail to amber_at_scripps.edu To unsubscribe, send "unsubscribe amber" to majordomo_at_scripps.edu

Next message: Wen Li: "Re: AMBER: align snapshots to a reference"
Previous message: David A. Case: "Re: AMBER: "The system has extended beyond" error"
Next in thread: Ross Walker: "RE: AMBER: Amber Performance in Parallel on Itanium"
Reply: Ross Walker: "RE: AMBER: Amber Performance in Parallel on Itanium"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

AMBER Archive (2005)Subject: AMBER: Amber Performance in Parallel on Itanium

AMBER Archive (2005)
Subject: AMBER: Amber Performance in Parallel on Itanium