AMBER Archive (2006)

Subject: RE: AMBER: MPI parallel problem in AMBER 8.0

From: Ross Walker (ross_at_rosswalker.co.uk)
Date: Tue May 02 2006 - 16:42:33 CDT


Hi MK

These sorts of tracebacks are often not of much use, however in your case it
does appear to yield some light on your problem.

> [0] MPI Abort by user Aborting program !
> forrtl: error (76): IOT trap signal
> 0: __FINI_00_remove_gp_range [0x3ff81a21488]
> 1: __FINI_00_remove_gp_range [0x3ff81a2a910]
> 2: __FINI_00_remove_gp_range [0x3ff800d9cc0]
> 3: __FINI_00_remove_gp_range [0x3ff800ed7d4]
> 4: __FINI_00_remove_gp_range [0x3ff802206a0]
> 5: __FINI_00_remove_gp_range [0x3ff80140554]
> 6: __FINI_00_remove_gp_range [0x3ff801d2748]
> 7: __FINI_00_remove_gp_range [0x3ffbffa08d0]
> 8: __FINI_00_remove_gp_range [0x3ffbff9c9c8]
> 9: __FINI_00_remove_gp_range [0x3ffbff9ca08]
> 10: mexit_ [_mexit.f: 355, 0x12011ad10]

Line 10 here is the important line. This was a call to the function mexit()
which is sander's legitimate exit routine. This generated errors 1 to 9
which is some funky stuff from your MPI implementation. I would just ignore
it. The thing to note is that the Abort was by user. I.e. not a segfault but
the code itself actually quit. In this case sander quit by calling mexit.

If we look further back along the trace:

> 11: mdread2_ [_mdread.f: 3690, 0x120095848]
> 12: sander_ [_sander.f: 2497, 0x1200504e4]
> 13: multisander_ [_sander.f: 885, 0x12004f240]
> 14: main [for_main.c: 203, 0x12013338c]
> 15: __start [0x12001b038]

We see that the routine that called mexit was mdread2. This routine is
responsible for reading in the namelist and other control variables. This
obviously found some error and aborted. The error is probably printed
somewhere in your output, either in the mdout file or in the stderr or
stdout file generated by your batch scheduler. Alternatively, depending on
how your system is setup it may have simply been lost. This loss of error
message is a common occurance with codes running in parallel...

Either way the problem is likely an error in your input. Especially if the
tests work but your run doesn't. Have you successfully tested this system
with a serial run? You should always do this first to check things are okay
before running in parallel. E.g. run it for 10 steps or so in serial and see
if it works okay. Error messages are generally much much clearer when
running in serial.

All the best
Ross

/\
\/
|\oss Walker

| HPC Consultant and Staff Scientist |
| San Diego Supercomputer Center |
| Tel: +1 858 822 0854 | EMail:- ross_at_rosswalker.co.uk |
| http://www.rosswalker.co.uk | PGP Key available on request |

Note: Electronic Mail is not secure, has no guarantee of delivery, may not
be read every day, and should not be used for urgent or sensitive issues.

-----------------------------------------------------------------------
The AMBER Mail Reflector
To post, send mail to amber_at_scripps.edu
To unsubscribe, send "unsubscribe amber" to majordomo_at_scripps.edu