I do think AWS need to improve their software to capture more downmarket traction, but my understanding is that even Trainium2 with virtually no public support was financially successful for Anthropic as well as for scaling AWS Bedrock workloads.
Ease of optimization at the architecture level is what matters at the bleeding edge; a pure-AI organization will have teams of optimization and compiler engineers who will be mining for tricks to optimize the hardware.
Turns out multi-billion dollar software companies can deal with the enormous software investment
Amazon has all the resources needed to write their own backends to several ML software or even drop-in API replacements.
Eventually economics win: where margins are high competition appears and in time margins get thinner and competition starts disappearing again, it's a cycle.
> In fact, they are conducting a massive, multi-phase shift in software strategy. Phase 1 is releasing and open sourcing a new native PyTorch backend. They will also be open sourcing the compiler for their kernel language called “NKI” (Neuron Kernal Interface) and their kernel and communication libraries matmul and ML ops (analogous to NCCL, cuBLAS, cuDNN, Aten Ops). Phase 2 consists of open sourcing their XLA graph compiler and JAX software stack.
> By open sourcing most of their software stack, AWS will help broaden adoption and kick-start an open developer ecosystem. We believe the CUDA Moat isn’t constructed by the Nvidia engineers that built the castle, but by the millions of external developers that dig the moat around that castle by contributing to the CUDA ecosystem. AWS has internalized this and is pursuing the exact same strategy.
AWS can make it seamless, so you can run open source models on their hardware.
See their ARM based instances, you rarely notice you are running on ARM, when using Lambda, k8s, fargate and others
With Alchip, Amazon is working on "more economical design, foundry and backend support" for its upcoming chip programs, according to Acree.
https://www.morningstar.com/news/marketwatch/20251208112/mar...
If AWS really delivers on open-sourcing more of the toolchain, that could be a much bigger signal for adoption than raw specs alone.
It doesn't have a lot of ports and certainly not enough NTB to be useful as a switch, but man, wild to me than an AMD Epyc core has 128 lanes of PCIe and that switch chips are struggling to match even a basic server's worth of net bandwidth.