AMBER Archive (2009)
Subject: Re: [AMBER] Error in PMEMD run
From: Marek Malý (maly_at_sci.ujep.cz)
Date: Fri May 08 2009 - 14:10:41 CDT
my testing system is composed of PPI dendrimer 4 gen + explicit wat,
total num. of atoms cca 60000.
Here are the input files for testing:
I know it is not a big system but for benchmark on 16-32 CPUs is OK I
think or am I wrong ?
For testing I used just 1000 steps from the equilibrium phase ( NPT
simulation see - equil_DEN_PPIp_D.in ).
Regarding to discs question.
Each node has his local harddrive (SATA 250 GB), so I run my jobs from the
node listed in relevant .mpd.hosts file.
Let say that if I want to run my job on 2 nodes (for example 11 and 12 )
I go to local disc of the node 11 and run the job from it.
This local discs are not shared yet.
Regarding to MPI, we are using Intel MPI (actually version 3.2.0.011).
here are my config commands for compilation of parallel Amber/PMEMD:
./configure_amber -intelmpi ifort (Parallel Amber)
./configure linux_em64t ifort intelmpi (PMEMD)
We have 14 nodes in total each node = 2 x Intel Xeon Quad-core 5365 ( 3,00
GHz) = 8 single CPUs
Nodes are connected using "Cisco InfiniBand".
So that's all what I can say about my testing system and our cluster.
Thanks for your time !
Dne Fri, 08 May 2009 20:24:35 +0200 Robert Duke <rduke_at_email.unc.edu>
> Yes, Ross makes points I was planning on making next. We need to know
> your benchmark. You should be running something like JAC, or even
> better yet, factor ix, from the benchmarks suite. Then you should
> convert your times to nsec/day and compare to some to the published
> values at www.ambermd.org to have a clue as to just how good or bad you
> are doing. Once you have a reasonable benchmark (not too small,
> balanced i/o, not asking for extra features that are known not to scale,
> etc etc), then we can look for other problems. Given a GOOD infiniband
> setup (high bandwidth, configured correctly, balance between pci express
> and the infiniband hca's, well-scaled infiniband switch layout, no noise
> from loose cables, etc etc etc), then the next likely source of grief is
> the disk. Are you all perhaps using an nfs-mounted volume, and even
> worse, one volume, not a parallel file system, being written to by
> multiple running jobs? Bad idea. Parallel jobs will hang like crazy
> waiting for the master to do disk i/o. Is mpi really set up correctly?
> The only way you know is if the setup has passed other benchmarks (I
> typically tell by comparison of pmemd on the candidate system to other
> systems, but believe me, mpi can really be screwed up pretty easily).
> Which mpi? OpenMPI is known to be bad with infiniband (I don't know if
> it is actually "good" with anything). Intel mpi is supposed to be
> good, but I have never tried to jump through all the configuration
> hoops. MVAPICH is pretty standard; once again, though, because I don't
> admin a system of this type, I have no idea how hard it is to get
> everything right. I am really sorry you are having so much "fun" with
> all this; I know it must be frustrating, but there is a reason bigger
> clusters get run by staff. By the way, how big is the cluster?
> Best Regards - Bob
> ----- Original Message ----- From: "Ross Walker" <ross_at_rosswalker.co.uk>
> To: "'AMBER Mailing List'" <amber_at_ambermd.org>
> Sent: Friday, May 08, 2009 2:11 PM
> Subject: RE: [AMBER] Error in PMEMD run
> Hi Marek,
> I don't think I've seen anywhere what the actual simulation you are
> is. This will have a huge effect on parallel scalability. With infiniband
> and a 'reasonable' system size you should easily be able to get beyond 2
> nodes. Here are some numbers for the JAC NVE benchmark from the suite
> provided on http://ambermd.org/amber10.bench1.html
> This is for NCSA Abe which is Dual x Quad core clovertown (E5345 2.33GHz
> very similar to your setup) and uses SDR infiniband.
> Using all 8 processors per node (time for benchmark in seconds):
> 8 ppn 8 cpu 364.09
> 8 ppn 16 cpu 202.65
> 8 ppn 24 cpu 155.12
> 8 ppn 32 cpu 123.63
> 8 ppn 64 cpu 111.82
> 8 ppn 96 cpu 91.87
> Using 4 processors per node (2 per socket):
> 4 ppn 8 cpu 317.07
> 4 ppn 16 cpu 178.95
> 4 ppn 24 cpu 134.10
> 4 ppn 32 cpu 105.25
> 4 ppn 64 cpu 83.28
> 4 ppn 96 cpu 67.73
> As you can see it is still scaling to 96 cpus (24 nodes at 4 threads per
> node). So I think you must either be running an unreasonably small
> system to
> expect scaling in parallel or there is something very wrong with the
> of your computer.
> All the best
>> -----Original Message-----
>> From: amber-bounces_at_ambermd.org [mailto:amber-bounces_at_ambermd.org] On
>> Behalf Of Marek Malý
>> Sent: Friday, May 08, 2009 10:58 AM
>> To: AMBER Mailing List
>> Subject: Re: [AMBER] Error in PMEMD run
>> Hi Gustavo,
>> thanks for your suggestion but we have only 14 nodes in our cluster
>> node = 2 x Xeon Quad-core 5365 (3,00 GHz) = 8 single CPUs per node
>> connected with "Cisco InfiniBand").
>> If I allocate 8 nodes and I use just 2 CPUs per node for one my job it
>> means that 8x6 single CPUs = 48 will be wasted. In this
>> case I am sure that my colleagues will kill me :)) Moreover I do not
>> assume that 8/2CPU combination will have significantly better
>> performance that 2/8CPU at least in case of PMEMD.
>> But anyway, thank you for your opinion/experience !
>> Dne Fri, 08 May 2009 19:28:35 +0200 Gustavo Seabra
>> <gustavo.seabra_at_gmail.com> napsal/-a:
>> >> the best performance I have obtained in case of using combination of
>> >> nodes
>> >> and 4 CPUs (from 8) per node.
>> > I don't know exactly what you have in your system, but I gather you
>> > are using 8core-nodes, and from it you got the best performance by
>> > leaving 4 cores idle. Is that correct?
>> > In this case, I would suggest that you go a bit further, and also
>> > using only 1 or 2 cores per node, i.e., leaving the remaining 6-7
>> > cores idle. So, for 16 MPI processes, try allocating 16 or 8 nodes.
>> > (I didn't see this case in your tests)
>> > AFAIK, The 8-core nodes are arranged in 2 4-core sockets, and the
>> > communication between core, that was already bad within the 4-cores
>> > the same socket, gets even worse when you need to get information
>> > between two sockets. Depending on your system, if you send 2
>> > to the same node, it may put all in the same socket or automatically
>> > split it one for each socket. You may also be able to tell it to make
>> > sure that this gets split in to 1 process per socket. (Look into the
>> > mpirun flags.) From the tests we've run on those kind of machines, we
>> > do get the best performance by leaving ALL BUT ONE core idle in each
>> > socket.
>> > Gustavo.
>> > _______________________________________________
>> > AMBER mailing list
>> > AMBER_at_ambermd.org
>> > http://lists.ambermd.org/mailman/listinfo/amber
>> > __________ Informace od NOD32 4051 (20090504) __________
>> > Tato zprava byla proverena antivirovym systemem NOD32.
>> > http://www.nod32.cz
>> Tato zpráva byla vytvořena převratným poštovním klientem Opery:
>> AMBER mailing list
> AMBER mailing list
> AMBER mailing list
> __________ Informace od NOD32 4051 (20090504) __________
> Tato zprava byla proverena antivirovym systemem NOD32.
Tato zpráva byla vytvořena převratným poštovním klientem Opery:
AMBER mailing list