I know they claim they work, but that's only on their happy path with their very specific AMI's and the nightmare that is the neuron SDK. You try to do any real work with them and use your own dependencies and things tend to fall apart immediately.
It was just in the past couple years that it really became worthwhile to use TPU's if you're on GCP and that's only with the huge investment on Google's part into software support. I'm not going to sink hours and hours into beta testing AWS's software just to use their chips.
PMs may like to imagine the service teams are customer-led, and perhaps this is so for capability/feature roadmaps, but the crushing anvil of personal dynamics with other PMs and sheer internal scale creates an unstoppable forcing function for polish, resilience, and handling of edge cases.
But yes, the less of a core building block the specific service is (or widely used internally in Amazon), the more likely you are to run into significant issues.
I wonder if the difference is stuff they dogfood versus stuff they don't?
Enlight us...
Seems AWS is using this heavily internally, which makes sense, but not observing it getting traction outside that. Glad to see Amazon investing there though.
We're not quite seeing that on the trn1 instances yet, so someone is using them.
If Anthropic walked out on stage today and said how amazing it was and how they’re using it the announcement would have a lot more weight. Instead… crickets from Anthropic in the keynote
This is the AWS press release from last month saying Anthropic is using 500k Trainium chips and will use 500k more: https://finance.yahoo.com/news/amazon-says-anthropic-will-us...
And this is the Anthropic press release from last month saying they will use more Google TPUs but also are continuing to use Trainium (see the last 2 paragraphs specifically): https://www.anthropic.com/news/expanding-our-use-of-google-c...
You can’t really read into that. They are unlikely to let their competitors know if they have a slight performance/$ edge by going with AWS tech.
Anthropic is not going to interrupt their competitors if their competitors don't want to use trainium. Neither would you, I, nor anyone else. The only potential is downside. There's no upside potential for them at all in doing so.
From Anthropic's perspective, if the rest of us can't figure out how to make trainium work? Good.
Amazon will fix the difficulty problem with time, but that's time Anthropic can use to press their advantages and entrench themselves in the market.
I used to work for an AI startup. This is where Nvidia's moat is - the tens of thousands of man-hours that has gone into making the entire AI ecosystem work well with Nvidia hardware and not much else.
It's not that they haven't thought of this, it's just that they don't want to hire another 1k engineers to do it.
Building an efficient compiler from high-level ML code to a TPU is actually quite a difficult software engineering feat, and it's not clear that Amazon has the kind of engineering talent needed to build something like that. Not like Google which have developed multiple compilers and language runtimes.
https://newsletter.semianalysis.com/p/amazons-ai-self-suffic...
[0] https://www.godaddy.com/domainsearch/find?domainToCheck=trai...
The sole reason amazon is throwing any money at this is because they think they can do to AI what they did with logistics and shipping in an effort to slash costs leading into a recession (we cant fire anyone else.) The hubris is magnanimous to say the least.
but the total confidence is very low...so "Nvidia friendly" is face saving to ensure no bridges they currently cross for AWS profit get burned.
AMD felt like they were so close to nabbing the accelerator future back in HyperTransport days. But the recent version Infinity Fabric is all internal.
There's Ultra Accelerator Link (UALink) getting some steam. Hypothetically CXL should be good for uses like this, using PCIe PHY but lower latency lighter weight; close to ram latency, not bad! But still a mere PCIe speed, not nearly enough, with PCIe 6.0 just barely emerging now. Ideally IMO we'd also see more chips come with integrated networking too: it was so amazing when Intel Xeon's had 100Gb Omni-Path for barely any price bump. UltraEthernet feels like it should be on core, gratis.
UltraEthernet feels like it should be on core, gratis.
I've been saying for a while that AMD should put a SolarFlare NIC in their I/O die. They already have switchable PCIe/SATA ports, why not switchable PCIe/Ethernet? UEC might be too niche though.
Pretty accurate in my experience, especially re: the neuron sdk. Do not use.