AMBER Archive (2005)

Subject: Re: AMBER: PMEMD with big systems

From: Robert Duke (rduke_at_email.unc.edu)
Date: Fri Mar 18 2005 - 18:12:24 CST


Florian -
Okay, not to ask 'obvious' questions, but I presume you have 'limit
stacksize unlimited' in BOTH your .login and .cshrc? For reasons that elude
me (think system 'features') this is sometimes necessary. Now, that may not
be it at all. There are a couple of other possible issues. There could be
mixed shared libraries for ifort on some of the nodes; I doubt this but it
is one installation issue, and I don't know what is done about versioning
these libraries. Finally, a likely issue is how myrinet s/w is built. I
have seen instances where you can run on 2 nodes (shared memory) on myrinet
but you segfault if you try more, which brings the interface into action.
Now, I don't really know what the root of this evil is, but your observation
that you can run if you spread the job out enough is interesting. I am
wondering if there are problems with the thread code in the myrinet
software. Here's the scenario: If myrinet s/w was built with static
linkage to the threads libraries, then static threads code would be used,
and for at least some recent linux releases like redhat 3, these static
libraries are known to have problems with small thread stacks (think seg
fault). Why more problems in pmemd than sander? Well, pmemd uses asynch
net i/o, which requires the use of threads; sander doesn't. This is all
guesswork of course. I would be sure to get whoever is in charge of
supporting your myrinet installation to test the stuff with async i/o and a
test suite that sends around big chunks of data, and would check into
exactly how it was built (use dynamic system libraries, not static - which
implies dynamic ifort libraries too). Hope this helps, but I am just
guessing. I have seen this stuff working on myrinet/opterons/pmemd 8 a
couple of weeks ago, factor ix and jac, between 2 and 64 procs (2, 4, 8 ,16,
24, 32, 40, 48, 56, 64 for factor_ix and jac), several thousand steps, not
the sign of a problem (but pathscale compiler). It also used to work fine
here at UNC before our myrinet h/w gave out (that was with earlier versions
of the fortran compiler).
If I can be of further help, please let me know, and if you all figure it
out, please let everyone know also.
Regards - Bob Duke

----- Original Message -----
From: "Florian Barth" <bio_hazard_at_gmx.de>
To: <amber_at_scripps.edu>
Sent: Friday, March 18, 2005 6:25 PM
Subject: AMBER: PMEMD with big systems

> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> Hi,
>
> I have some trouble running PMEMD (amber8) with big sytems like hb or
> factor_ix. For parallel runs a certain minimum number of cpus are needed
> to run those systems otherwise PMEMD will segfault. For example, PMEMD
> with factor_ix needs a minimun of 14 cpus to run (about 6500 atoms/cpus).
> PMEMD is running on a linux cluster (gentoo linux kernel 2.4.28), with
> dual athlon-mp nodes and myrinet interconnect. I used the intel ifort
> 8.1.024 compiler with the new_configure utility for compilation. Myrinet
> software is gm version 1.6.5 (compiled with gcc) and mpich-gm 1.2.6..14a
> (gcc/ifort).
> Serial PMEMD and serial/parallel sander is running without problems with
> factor_ix. Stack size is unlimited and shmem is set to 1 GB on all nodes.
> I was able to run factor_ix on 2 cpus with PMEMD during the installation
> phase of one of the nodes. But after some reboots the above limitation
> came up; unfortunately I have no idea what could have changed.
>
> Any hint would be greatly appreciated.
>
> Florian Barth
>
> ____________________________________
>
> Florian Barth
> Institute of Technical Biochemistry
> University of Stuttgart
> Allmandring 31
> 70569 Stuttgart
> Germany
> tel.:+49-711-6853811
> fax.:+49-711-6853196
> email:bio_hazard_at_gmx.de
> http://www.itb.uni-stuttgart.de
> -----BEGIN PGP SIGNATURE-----
> Version: GnuPG v1.2.3 (GNU/Linux)
> Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org
>
> iD8DBQFCO2N1wYcs9fJ1MJIRAsQOAKChJ0oLC7sAM5R4af3gfgpvZhb7qwCfSCnk
> CG2a25uGyaQkcown2wCic+Q=
> =r6Qv
> -----END PGP SIGNATURE-----
> -----------------------------------------------------------------------
> The AMBER Mail Reflector
> To post, send mail to amber_at_scripps.edu
> To unsubscribe, send "unsubscribe amber" to majordomo_at_scripps.edu
>

-----------------------------------------------------------------------
The AMBER Mail Reflector
To post, send mail to amber_at_scripps.edu
To unsubscribe, send "unsubscribe amber" to majordomo_at_scripps.edu