Most existing benchmarks (e.g., nccl-tests, NVSHMEM) measure collective libraries, which makes it hard to isolate raw EFA transport performance. With growing interest in GPU-Initiated Networking (GIN) (e.g., Deep-EP), we wanted a way to directly measure EFA latency, bandwidth, and scaling behavior at a lower level.
Libefaxx complements existing tools by helping engineers and researchers better understand and optimize inter-node communication for large-scale distributed training on AWS.