Quantcast
Channel: Clusters and HPC Technology
Viewing all articles
Browse latest Browse all 930

Notification of a failed dead node existence using the PSM2

$
0
0

Hello everyone,

I am writing because I am currently implementing a failure recovery system for a cluster with Intel OmniPath that will be designated for handling computations in a physical experiment. What I want to implement is a mechanism to detect a node that failed and to notify rest of the nodes. I tried to check the node failure by invoking psm2_poll. Unfortunately, as I saw in the Intel ® Performance ScaledMessaging 2 (PSM2) Programmer’s Guide, this function does not return errors (values) other than OK or OK_NO_PROGRESS (this is at least what I have observed in my application - the poll on a dead node behaves as if the node did not fail/disconnect and did not send any message). 

So the question is: What are the methods of notifying other nodes after node failure ? Is there a lightweight function that I can invoke along with poll to check if the node from whom I am trying to get messages exists ?

In worst case, I can implement this using a counter and a timeout, but if there is a mechanism supported by the API, I am wide open.

Best Regards


Viewing all articles
Browse latest Browse all 930

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>