Show HN: "htop" for PyTorch training, see stalls, memory and step time live
3 points
1 hour ago
| 0 comments
| HN
I built a small tool that shows live, step-by-step:

step time (CUDA events, no forced GPU sync)

dataloader time

GPU memory (incl. reserved vs allocated)

The goal is to make it easy to correlate stalls ↔ memory pressure ↔ compute time while the run is happening (instead of only post-hoc profiling).

Usage is intentionally minimal: 3 hooks (a context manager + a function + a decorator). Output can be viewed as a local dashboard / CLI / notebook.

Repo: https://github.com/traceopt-ai/traceml/

Feedback I would love:

What’s the one signal you always wish you had when debugging slow training?

What should this integrate with (or avoid) to stay low-overhead?

No one has commented on this post.