Show HN: nvsx – A hook layer for NVSentinel GPU fault remediation
1 points
2 hours ago
| 1 comment
| github.com
| HN
nigardev
1 hour ago
[-]
curious what specific fault modes this handles. is it mainly for ECC errors or something else like timeout recovery. also wondering how this compares to just restarting the affected process, which has worked for us on workstation GPUs
reply
essekar
1 hour ago
[-]
anything tbh. as long as you have runbook - you can try to automate actions through nvsx; it sits on top of NVSentinel. restarting will work mostly for smaller jobs - distributed training, pretty common will need more fault tolerant methods to continue rather than just restarting.
reply