AMBER Archive (2009)

Subject: RE: [AMBER] amber10 pmemd fail

From: Ross Walker (ross_at_rosswalker.co.uk)
Date: Fri Dec 11 2009 - 21:43:53 CST


Hi Aragorn,

> I would like a bit of advice. I have compiled pmemd for a researcher
> here at Wayne State. When we run it it dies at varying points. We get
> various errors from mpirun such as:
>
> p8_15722: (1054.988281) net_recv failed for fd = 9
> p8_15722: p4_error: net_recv read, errno = : 110
> p14_21538: p4_error: Found a dead connection while looking for
> messages: 9

This looks like a hardware issue to me. Are you certain your interconnect is all good and working properly. PMEMD can stress the interconnect far more than most codes so if you have flakey cables etc it may not show up in other runs, especially if they use blocking MPI instead of the nonblocking MPI that pmemd uses.

I would suggest getting hold of a stress test library for your MPI implementation and running this on the machine to see if you have issues with the hardware. This can often show up as low bandwidth for large size messages in mpi bandwidth tests. I have seen this A LOT on infiniband which can be VERY sensitive to flakey cable connections.

You should also check you are linking to the correct MPI libraries etc but I suspect hardware more than anything else.

Make sure you have applied all the latest bugfixes as well.

All the best
Ross

/\
\/
|\oss Walker

| Assistant Research Professor |
| San Diego Supercomputer Center |
| Tel: +1 858 822 0854 | EMail:- ross_at_rosswalker.co.uk |
| http://www.rosswalker.co.uk | http://www.wmd-lab.org/ |

Note: Electronic Mail is not secure, has no guarantee of delivery, may not be read every day, and should not be used for urgent or sensitive issues.

_______________________________________________
AMBER mailing list
AMBER_at_ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber