▲curious what specific fault modes this handles. is it mainly for ECC errors or something else like timeout recovery. also wondering how this compares to just restarting the affected process, which has worked for us on workstation GPUs
reply▲anything tbh. as long as you have runbook - you can try to automate actions through nvsx; it sits on top of NVSentinel.
restarting will work mostly for smaller jobs - distributed training, pretty common will need more fault tolerant methods to continue rather than just restarting.
reply