A couple of examples:
Kimi K2 Thinking (1 trillion parameters): https://x.com/awnihannun/status/1986601104130646266
DeepSeek R1 (671B): https://x.com/awnihannun/status/1881915166922863045 - that one came with setup instructions in a Gist: https://gist.github.com/awni/ec071fd27940698edd14a4191855bba...
The release in Tahoe 26.2 will enable us to do fast tensor parallelism in MLX. Each layer of the model is sharded across all machines. With this type of parallelism you can get close to N-times faster for N machines. The main challenge is latency since you have to do much more frequent communication.
Exo-Labs: https://github.com/exo-explore/exo
Earlier this year I experimented with building a cluster to do tensor parallelism across large cache CPUs (AMD EPYC 7773X have 768mb of L3). My thought was to keep an entire model in SRAM and take advantage of the crazy memory bandwidth between CPU cores and their cache, and use Infiniband between nodes for the scatter/gather operations.
Turns out the sum of intra-core latency and PCIe latency absolutely dominate. The Infiniband fabric is damn fast once you get data to it, but getting it there quickly is a struggle. CXL would help but I didn't have the budget for newer hardware. Perhaps modern Apple hardware is better for this than x86 stuff.
The way it typically works in an attention block is: smaller portions of the Q, K and V linear layers are assigned to each node and are processed independently. Attention, rope norm etc is run on the node-specific output of that. Then, when the output linear layer is applied an "all reduce" is computed which combines the output of all the nodes.
EDIT: just realized it wasn't clear -- this means that each node ends up holding a portion of the KV cache specific to its KV tensor shards. This can change based on the specific style of attention (e.g., in GQA where there are fewer KV heads than ranks you end up having to do some replication etc)
I am asking, however, is whether that will speed up decoding as linearly as it would for prefilling.
In our benchmarks with MLX / mlx-lm it's as much as 3.5x for token generation (decoding) at batch size 1 over 4 machines. In that case you are memory bandwidth bound so sharding the model and KV cache 4-ways means each machine only needs to access 1/4th as much memory.
Note fast sync workaround
Hopefully this makes it really nice for people that want the experiment with LLMs and have a local model but means well funded companies won’t have any reason to grab them all vs GPUs.
That said, the need for them also faded. The new chips have performance every bit as good as the eGPU-enhanced Intel chips.
What this does offer is a good alternative to GPUs for smaller scale use and research. At small scale it’s probably competitive.
Apple wants to dominate the pro and serious amateur niches. Feels like they’re realizing that local LLMs and AI research is part of that, is the kind of thing end users would want big machines to do.
I think Apple is done with expansion slots, etc.
You'll likely see M5 Mac Studios fairly soon.
Also, I’m curious and in case anyone that knows reads this comment:
Apple say they can’t get the performance they want out of discreet GPUs.
Fair enough. But yet nVidia becomes the most valuable company in the world selling GPUs.
So…
Now I get that Apples use case is essentially sealed consumer devices built with power consumption and performance tradeoffs in mind.
But could Apple use its Apple Silicon tech to build a Mac Pro with its own expandable GPU options?
Or even other brand GPUs knowing they would be used for AI research etc…. If Apple ever make friends with nVidia again of course :-/
What we know of Tim Cooks Apple is that it doesn’t like to leave money on the table, and clearly they are right now!
Theoretically they could farm out the GPU to another company but it seems like they’re set on owning all of the hardware designs.
SJ loved to quote Alan Kay:
"People who are really serious about software should make their own hardware."
Qualcomm are the latest on the chopping block, history repeating itself.
If I were a betting man I'd say Apple's never going back.
I guess there are other kinds of scientific simulation, very large dev work, and etc., but those things are quite a bit more niche.
Having used both professionally, once you understand how to drive Apple's MDM, Mac OS is as easy to sysadmin as Linux. I'll grant you it's a steep learning curve, but so is Linux/BSD if you're coming at it fresh.
In certain ways it's easier - if you buy a device through Apple Business you can have it so that you (or someone working in a remote location) can take it out of the shrink wrap, connect it to the internet, and get a configured and managed device automatically. No PXE boot, no disk imaging, no having it shipped to you to configure and ship out again. If you've done it properly the user can't interrupt/corrupt the process.
The only thing they're really missing is an iLo, I can imagine how AWS solved that, but I'd love to know.
This is an issue for some industry-standard software like CUDA, which does provide BSD drivers with ARM support that just never get adopted by Apple: https://www.nvidia.com/en-us/drivers/unix/
(Edit: interesting, thanks. So the underlying OS APIs that supply the power-consumption figures reported by asitop are just outright broken. The discrepancy is far too large to chalk up to static power losses or die-specific calibration factors that the video talks about.)
Nowadays I fire off async jobs that involve 1000s of requests, billion of tokens, yet it costs basically the same as if I didn't.
Maybe it takes a different type of person, than the one I am, but all these "pay-as-you-go"/tokens/credits platforms make me nervous to use, and I end up not using it or spending time trying to "optimize", while investing in hardware and infrastructure I can run at home and use that seems to be no problem for my head to just roll with.
And just because you're mostly using local models doesn't mean you can't use API hosted models in specific contexts. Of course, then the same dread sets in, but if you can do 90% of the tokens with local models and 10% with pay-per-usage API hosted models, you get the best of both worlds.
When the 2019 Mac Pro came out, it was "amazing" how many still photography YouTubers all got launch day deliveries of the same BTO Mac Pro, with exactly the same spec:
18 core CPU, 384GB memory, Vega II Duo GPU and an 8TB SSD.
Or, more likely, Apple worked with them and made sure each of them had this Mac on launch day, while they waited for the model they actually ordered. Because they sure as hell didn't need an $18,000 computer for Lightroom.
The personal computing situation is great right now. RAM is temporarily more expensive, but it's definitely not ending any eras.
We're going back to the "consumer PCs have 8GB of RAM era" thanks to the AI bubble.
Home PCs are as cheap as they’ve ever been. Adjusted for inflation the same can be said about “home use” Macs. The list price of an entry level MacBook Air has been pretty much the same for more than a decade. Adjust for inflation, and you get a MacBook air for less than half the real cost of the launch model that is massively better in every way.
A blip in high end RAM prices has no bearing on affordable home computing. Look at the last year or two and the proliferation of cheap, moderately to highly speced mini desktops.
I can get a Ryzen 7 system with 32gb of ddr5, and a 1tb drive delivered to my house before dinner tomorrow for $500 + tax.
That’s not depressing, that’s amazing!
That's an amazing price, but I'd like to see where you're getting it. 32GB of RAM alone costs €450 here (€250 if you're willing to trust Amazon's February 2026 delivery dates).
Getting a PC isn't that expensive, but after the blockchain hype and then the AI hype, prices have yet to come down. All estimations I've seen will have RAM prices increase further until the summer of next year, and the first dents in pricing coming the year after at the very earliest.
Shipping might screw you but here’s in stock 32gb kits of name brand RAM from a well known retailer in the US for $280[1].
Edit: same crucial RAM kit is 220GBP in stock at amazon[2]
(0)https://www.amazon.com/BOSGAME-P3-Gigabit-Ethernet-Computer/...
(1)https://www.bhphotovideo.com/c/product/1809983-REG/crucial_c...
(2) https://www.amazon.co.uk/dp/B0CTHXMYL8?tag=pcp0f-21&linkCode...
A blip in high end RAM prices
It's not a blip and it's not limited to high end machines and configurations. Altman gobbled up the lion's share of wafer production. Look at that Raspberry Pi article that made it to the front page, that's pretty far from a high end Mac and according to the article's author likely to be exported from China due to the RAM supply crisis. I can get a Ryzen 7 system with 32gb of ddr5, and a 1tb drive delivered to my house
before dinner tomorrow for $500 + tax.
B&H is showing a 7700X at $250 with their cheapest 32GB DDR5 5200 sticks at $384. So you've already gone over budget for just the memory and CPU. No motherboard, no SSD.Amazon is showing some no-name stuff at $298 as their cheapest memory and a Ryzen 7700X at $246.
Add another $100 for an NVMe drive and another $70–100 for the cheapest AM5 motherboards I could find on either of those sites.
If everything remains the same, RAM pricing will also. I have never once found a period in known history where everything stays the same, and I would be willing to bet 5 figures that at some point in the future I will be able to buy DDR5 or better ram for cheaper than today. I can point out that in the long run, prices for computing equipment have always fallen. I would trust that trend a lot more than a shortage a few months old changing the very nature of commodity markets. Mind you, I’m not the richest man on earth either, so my pattern matched opinion should be judged the same.
> B&H is showing a 7700X at $250 with their cheapest 32GB DDR5 5200 sticks at $384. So you've already gone over budget for just the memory and CPU. No motherboard, no SSD.
I didn't say I could build one from parts. Instead I said buy a mini pc, and then went and looked up the specs and price point to be sure.
The PC that I was talking about is here[https://a.co/d/6c8Udbp]. I live in Canada so translated the prices to USD. Remember that US stores are sometimes forced to hide a massive import tax in those parts prices. The rest of the world isn’t subject to that and pays less.
Edit: here’s an equivalent speced pc available in the US for $439 with a prime membership. So even with the cost of prime membership you can get a Ryzen 7 32gb 1tb for $455. https://www.amazon.com/BOSGAME-P3-Gigabit-Ethernet-Computer/...
If the current RAM supply crisis continues, it is very likely that these kinds of offers will disappear and that systems like this will become more expensive as well, not to mention all the other products that rely on DRAM components.
I also don’t believe RAM prices will drop again anytime soon, especially now that manufacturers have seen how high prices can go while demand still holds. Unlike something like graphics cards, RAM is not optional, it is a fundamental requirement for building any computer (or any device that contains one). People don’t buy it because they want to, but because they have to.
In the end, I suspect that some form of market-regulating mechanism may be required, potentially through government intervention. Otherwise, it’s hard for me to see what would bring prices down again, unless Chinese manufacturers manage to produce DRAM at scale, at significantly lower cost, and effectively flood the market.
People that can reliably predict the future
You don't need to be a genius or a billionaire to realize that when most of the global supply of a product becomes unavailable the remaining supply gets more expensive. here’s an equivalent speced pc available in the US for $439 with a prime membership.
So with prime that's $439+139 for $578 which is only slightly higher than the cost without prime of $549.99.Yes. Absolutely correct if you are talking about the short term. I was talking about the long term, and said that. If you are so certain would you take this bet: any odds, any amount that within 1 month I can buy 32gb of new retail DDR5 in the US for at least 10% less than the $384 you cited. (think very hard on why I might offer you infinite upside so confidently. It's not because I know where the price of RAM is going in the short term)
> So with prime that's $439+139 for $578 which is only slightly higher than the cost without prime of $549.99.
At this point I can't tell if you are arguing in bad faith, or just unfamiliar with how prime works. Just in case: You have cited the cost of prime for a full year. You can buy just a month of prime for a maximum price of $14.99 (that's how I got $455) if you have already used your free trial, and don't qualify for any discounts. Prime also allows cancellation within 14 days of signing up for a paid option, which is more than enough time to order a computer, and have it delivered, and cancel for a full refund.
So really, if you use a trial or ask for a refund for your prime fees the price is $439. So we have actually gotten the price a full 10% lower than I originally cited.
Edit: to eliminate any arguments about Prime in the price of the PC, here's an indentically speced mini PC for the same price from Newegg https://www.newegg.com/p/2SW-00BM-00002
I agree that we've seen similar fluctuations in the past and the price of compute trends down in the long-term. This could be a bubble, which it likely is, in which case prices should return to baseline eventually. The political climate is extremely challenging at this time though so things could take longer to stabilize. Do you think we're in this ride for months or years?
Maybe the AI money train stops after Christmas. The entire economy is fucked, but RAM is cheap.
Maybe we unlock AGI and the price sky rockets further before factories can get built.
There are just too many variables.
The real test is if someone had seen this coming, they would have made massive absurd investment returns just by buying up stock and storing it for a few months. Anyone who didn’t take advantage of that opportunity has proved that they had no real confidence in their ability to predict the future price of RAM. RAM inventory might have been one of the highest return investments possible this year. Where are all the RAM whales in Lambos who saw this coming?
As a corollary: we can say that unless you have some skin in the game and have invested a significant amount of your wealth in RAM chips, then you don’t know which way the price is going or when.
Extending that even further: people complaining about RAM prices being so high, and moaning that they bought less RAM because of it are actually signaling through action that they think that prices will go down or have leveled off. Anyone who believes that sticks of DDR5 RAM will continue the trend should be cleaning out Amazon, Best Buy and Newegg since the price will never be lower than today.
The distinct lack of serious people saying “I told ya so” with receipts, combined with the lack of people hoarding RAM to sell later is a good indirect signal that no one knows what is happening in the near term.
just the 5090 GPU costs +$3k, what are you even talking about
I’m talking about the hundreds of affordable models that are perfectly suitable for everything up to and including AAA gaming.
The existence of expensive, and very much optional, high end computer parts does not mean that affordable computers are not more incredible than ever.
Just because cutting edge high end parts are out of reach to you, does not mean that perfectly usable computers are too, as I demonstrated with actual specs and prices in my post.
That’s what I’m talking about.
How much as a base model MacBook Air changed in price over the last 15 years? With inflation, it's gotten cheaper.
The original base MacBook Air sold for $1799 in 2008. The inflation adjusted price is $2715.
The current base model is $999, and literally better in every way except thickness on one edge.
If we constrain ourselves to just 15 years. The $999 MBA was released 15 years ago ($1488 in real dollars). The list price has remained the same for the base model, with the exception of when they sold the discontinued 11” MBAs for $899.
It’s actually kind of wild how much better and cheaper computers have gotten.
The analogous PC for this era requires a large amount of high speed memory and specialized inference hardware.
You can call a computer a calculator, but that doesn’t make it a calculator.
Can they run SOTA LLMs? No. Can they run smaller, yet still capable LLMs? Yes.
However, I don’t think that the ability to run SOTA LLMs is a reasonable expectation for “a computer in every home” just a few years into that software category even existing.
I feel like you’re twisting the goalposts to make your point that it has to be local compute to have access to AI. Why does it need to be local?
Update: I take it back. You can get access to AI for free.
Here’s a text edition: For $50k the inference hardware market forces a trade-off between capacity and throughput:
* Apple M3 Ultra Cluster ($50k): Maximizes capacity (3TB). It is the only option in this price class capable of running 3T+ parameter models (e.g., Kimi k2), albeit at low speeds (~15 t/s).
* NVIDIA RTX 6000 Workstation ($50k): Maximizes throughput (>80 t/s). It is superior for training and inference but is hard-capped at 384GB VRAM, restricting model size to <400B parameters.
To achieve both high capacity (3TB) and high throughput (>100 t/s) requires a ~$270,000 NVIDIA GH200 cluster and data center infrastructure. The Apple cluster provides 87% of that capacity for 18% of the cost.
I've limited power consumption to what I consider the optimum, each card will draw ~275 Watts (you can very nicely configure this on a per-card basis). The server itself also uses some for the motherboard, the whole rig is powered from 4 1600W supplies, the gpus are divided 5/5/4 and the mother board is connected to its own supply. It's a bit close to the edge for the supplies that have five 3090's on them but so far it held up quite well, even with higher ambient temps.
Interesting tidbit: at 4 lanes/card throughput is barely impacted, 1 or 2 is definitely too low. 8 would be great but the CPUs don't have that many lanes.
I also have a threadripper which should be able to handle that much RAM but at current RAM prices that's not interesting (that server I could populate with RAM that I still had that fit that board, and some more I bought from a refurbisher).
If you can afford the 16 (pcie 3) lanes, you could get a PLX ("PCIe Gen3 PLX Packet switch X16 - x8x8x8x8" on ebay for like $300) and get 4 of your cards up to x8.
So that switch would probably work but I wonder how big the benefit would be: you will probably see effectively an x4 -> (x4 / x8) -> (x8 / x8) -> (x8 / x8) -> (x8 / x4) -> x4 pipeline, and then on to the next set of four boards.
It might run faster on account of the three passes that are are double the speed they are right now as long as the CPU does not need to talk to those cards and all transfers are between layers on adjacent cards (very likely), and with even more luck (due to timing and lack of overlap) it might run the two x4 passes at approaching x8 speeds as well. And then of course you need to do this a couple of times because four cards isn't enough, so you'd need four of those switches.
I have not tried having a single card with fewer lanes in the pipeline but that should be an easy test to see what the effect on throughput of such a constriction would be.
But now you have me wondering to what extent I could bundle 2 x8 into an x16 slot and then to use four of these cards inserted into a fifth! That would be an absolutely unholy assembly but it has the advantage that you will need far fewer risers, just one x16 to x8/x8 run in reverse (which I have no idea if that's even possible but I see no reason right away why it would not work unless there are more driver chips in between the slots and the CPUs, which may be the case for some of the farthest slots).
PCIe is quite amazing in terms of the topology tricks that you can pull off with it, and c-payne's stuff is extremely high quality.
And I'm way above 3 kW, more likely 5000 to 5500 with the GPUs running as high as I'll let them, or thereabouts, but I only have one power meter and it maxes out at 2500 watts or so. This is using two Xeons in a very high end but slightly older motherboard. When it runs the space that it is in becomes hot enough that even in the winter I have to use forced air from outside otherwise it will die.
As for electricity costs, I have 50 solar panels and on a good day they more than offset the electricity use, at 2 pm (solar noon here) I'd still be pushing 8 KW extra back into the grid. This obviously does not work out so favorably in the winter.
Building a system like this isn't very hard, it is just a lot of money for a private individual but I can afford it, I think this build is a bit under $10K, so a fraction of what you'd pay for a commercial solution but obviously far less polished and still less performant. But it is a lot of bang for the buck and I'd much rather have this rig at $10K than the first commercial solution available at a multiple of this.
I wrote a bit about power efficiency in the run-up to this build when I only had two GPUs to play with:
https://jacquesmattheij.com/llama-energy-efficiency/
My main issue with the system is that it is physically fragile, I can't transport it at all, you basically have to take it apart and then move the parts and re-assemble it on the other side. It's just too heavy and the power distribution is messy so you end up with a lot of loose wires and power supplies. I could make a complete enclosure for everything but this machine is not running permanently and when I need the space for other things I just take it apart, store the GPUs in their original boxes until the next home-run AI project. Putting it all together is about 2 hours of work. We call it Frankie, on account of how it looks.
edit: one more note, the noise it makes is absolutely incredible and I would not recommend running something like this in your house unless you are (1) crazy or (2) have a separate garage where you can install it.
You might be able to use USB4 but unsure how the latency is for that.
Strix Halo IO topology: https://www.techpowerup.com/cpu-specs/ryzen-ai-max-395.c3994
Frameworks mainboard implements 2 of those PCIe4x4 GPP interfaces as M.2 PHY's which you can use a passive adapter to connect a standard PCIe AIC (like a NIC or DPU) to, and also interestingly exposes that 3rd x4 GPP as a standard x4 length PCIe CEM slot, though the system/case isn't compatible with actually installing a standard PCIe add in card in there without getting hacky with it, especially as it's not an open-ended slot.
You absolutely could slap 1x SSD in there for local storage, and then attach up to 4x RDMA supporting NIC's to a RoCE enabled switch (or Infiniband if you're feeling special) to build out a Strix Halo cluster (and you could do similar with Mac Studio's to be fair). You could get really extra by using a DPU/SmartNIC that allows you to boot from a NVMeoF SAN to leverage all 5 sets of PCIe4x4 for connectivity without any local storage but we're hitting a complexity/cost threshold with that that I doubt most people want to cross. Or if they are willing to cross that threshold, they'd also be looking at other solutions better suited to that that don't require as many workarounds.
Apple's solution is better for a small cluster, both in pure connectivity terms and also with respect to it's memory advantages, but Strix Halo is doable. However, in both cases, scaling up beyond 3 or especially 4 nodes you rapidly enter complexity and cost territory that is better served by nodes that are less restrictive unless you have some very niche reason to use either Mac's (especially non-pro) or Strix Halo specifically.
That being said, for inference mac still remain the best, and the M5 Ultra will even be a better value with its better PP.
• CPU: AMD Ryzen Threadripper PRO 7995WX (96-Core) • Cost: $10,000
• Motherboard: WRX90 Chipset (supports 7x PCIe Gen5 slots) • Cost: $1,200
• RAM: 512GB DDR5 ECC Registered • Cost: $2,000
• Chassis & Power: Supermicro or specialized Workstation case + 2x 1600W PSUs. • Cost: $1,500
• Total Cost: ~$50,700
It’s a bit maximalist, but if you had to spend $50k it’s going to be about as fast as you can make it.
The main OS needs to run somewhere. At least for now.
with MGX and CX8 we see PCIe root moving to the NIC, which is very exciting.
Now you need to add 8 $5K monitors to get something similarly ludicrous.
Wake me up when the situation improves
1. The power button is in an awkward location, meaning rackmounting them (either 10" or 19" rack) is a bit cumbersome (at best)
2. Thunderbolt is great for peripherals, but as a semi-permanent interconnect, I have worries over the port's physical stability... wish they made a Mac with QSFP :)
3. Cabling will be important, as I've had tons of issues with TB4 and TB5 devices with anything but the most expensive Cable Matters and Apple cables I've tested (and even then...)
4. macOS remote management is not nearly as efficient as Linux, at least if you're using open source / built-in tooling
To that last point, I've been trying to figure out a way to, for example, upgrade to macOS 26.2 from 26.1 remotely, without a GUI, but it looks like you _have_ to use something like Screen Sharing or an IP KVM to log into the UI, to click the right buttons to initiate the upgrade.
Trying "sudo softwareupdate -i -a" will install minor updates, but not full OS upgrades, at least AFAICT.
https://www.owc.com/solutions/thunderbolt-dock
It's a poor imitation of old ports that had screws on the cables, but should help reduce inadvertent port stress.
The screw only works with limited devices (ie not the Mac Studio end of the cord) but it can also be adhesive mounted.
See for example:
Apparently since 2016 https://www.usb.org/sites/default/files/documents/usb_type-c...
So for any permanent Thunderbolt GPU setups, they should really be using this type of cable
erase-install can be run non-interactively when the correct arguments are used. I've only ever used it with an MDM in play so YMMV:
Thunderbolt as a server interconnect displeases me aesthetically but my conclusion is the opposite of yours:
If the systems are locked into place as servers in a rack the movements and stresses on the cable are much lower than when it is used as a peripheral interconnect for a desktop or laptop, yes ?
Apple’s chassis do not support it. But conceptually that’s not a Thunderbolt problem, it’s an Apple problem. You could probably drill into the Mac Studio chassis to create mount points.
I think you can do this if you install a MDM profile on the Macs and use some kind of management software like Jamf.
Legally, you probably need a Mac. Or rent access to one, that's probably cheaper.
I'd have some other uses for RDMA between Macs.
https://github.com/Anemll/mlx-rdma/commit/a901dbd3f9eeefc628...
Is this part of Apple’s plan of building out server side AI support using their own hardware?
If so they would need more physical data centres.
I’m guessing they too would be constrained by RAM.
See: https://ml-explore.github.io/mlx/build/html/usage/distribute...
I’m not sure if it would be of much utility because this would presumably be for tensor parallel workloads. In that case you want the ranks in your cluster to be uniform or else everything will be forced to run at the speed of the slowest rank.
You could run pipeline parallel but not sure it’d be that much better than what we already have.
rdma_ctl enable
Don’t get me wrong... It’s super cool, but I fail to understand why money is being spent on this.
(The cord is $50 because it contains two active chips BTW.)
The ability to also deliver 240W (IIRC?) over the same cable is also a bit different here, it's more like FireWire than a standard networking cable.
Thunderbolt5's stated "80Gbps" bandwidth comes with some caveats. That's the figure for either Display Port bandwidth itself or in practice more often realized by combining the data channel (PCIe4x4 ~=64Gbps) with the display channels (=<80Gbps if used in concert with data channels), and potentially it can also do unidirectional 120Gbps of data for some display output scenarios.
If Apple's silicon follows spec, then that means you're most likely limited to PCIe4x4 ~=64Gbps bandwidth per TB port, with a slight latency hit due to the controller. That Latency hit is ItDepends(TM), but if not using any other IO on that controller/cable (such as display port), it's likely to be less than 15% overhead vs Native on average, but depending on drivers, firmware, configuration, usecase, cable length, and how apple implemented TB5, etc, exact figures very. And just like how 60FPS Average doesn't mean every frame is exactly 1/60th of a second long, it's entirely possible that individual packets or niche scenarios could see significantly more latency/overhead.
As a point of reference Nvidia RTX Pro (formerly known as quadro) workstation cards of Ada generation and older along with most modern consumer grahics cards are PCIe4 (or less, depending on how old we're talking), and the new RTX Pro Blackwell cards are PCIe5. Though comparing a Mac Studio M4 Max for example to an Nvidia GPU is akin to comparing Apples to Green Oranges
However, I mention the GPU's not just to recognize the 800lb AI compute gorilla in the room, but also that while it's possible to pool a pair of 24GB VRAM GPU's to achieve a 48GB VRAM pool between them (be it through a shared PCIe bus or over NVlink), the performance does not scale linearly due to PCIe/NVLinks limitations, to say nothing of the software, and configuration and optimization side of things also being a challenge to realizing max throughput in practice.
This is also just as true as a pair of TB5 equipped macs with 128GB of memory each using TB5 to achieve a 256GB Pool will take a substantial performance hit compared to on otherwise equivalent mac with 256GB. (capacities chosen are arbitrary to illustrate the point). The exact penalty really depends on usecase and how sensitive it is to the latency overhead of using TB5 as well as the bandwidth limitation.
It's also worth noting that it's not just entirely possible with RDMA solutions (no matter the specifics) to see worse performance than using a singular machine if you haven't properly optimized and configured things. This is not hating on the technology, but a warning from experience for people who may have never dabbled to not expect things to just "2x" or even just better than 1x performance just by simply stringing a cable between two devices.
All that said, glad to see this from Apple. Long overdue in my opinion as I doubt we'll see them implement an optical network port with anywhere near that bandwidth or RoCEv2 support, much less a expose a native (not via TB) PCIe port on anything that's a non-pro model.
EDIT: Note, many mac skus have multiple TB5 ports, but it's unclear to me what the underlying architecture/topology is there and thus can't speculate on what kind of overhead or total capacity any given device supports by attempting to use multiple TB links for more bandwidth/parallelism. If anyone's got an SoC diagram or similar refernce data that actually tells us how the TB controller(s) are uplinked to the rest of the SoC, I could go in more depth there. I'm not an Apple silicon/MacOS expert. I do however have lots of experience with RDMA/RoCE/IB clusters, NVMeoF deployments, SXM/NVlink'd devices and generally engineering low latency/high performance network fabrics for distributed compute and storage (primarily on the infrastructure/hardware/ops side than on the software side) so this is my general wheelhouse, but Apple has been a relatively blindspot for me due to their ecosystem generally lacking features/support for things like this.
Using more smaller nodes means your cross-node IO is going to explode. You might save money on your compute hardware, but I wouldn't be surprised if you'd end up with an even greater cost increase on the network hardware side.
I don't think I can recommend the Mac Studio for AI inference until the M5 comes out. And even then, it remains to be seen how fast those GPUs are or if we even get an Ultra chip at all.
It might be cost effective, but the supplier is still saying "you get no support, and in fact we might even put roadblocks in your way because you aren't the target customer".
I'm sure Apple could make a killing on the server side, unfortunately their income from their other products is so big that even if that's a 10B/year opportunity they'll be like "yawn, yeah, whatever".
The way this capability is exposed in the OS is that the computers negotiate an Ethernet bridge on top of the TB link. I suspect they're actually exposing PCIe Ethernet NICs to each other, but I'm not sure. But either way, a "Thunderbolt router" would just be a computer with a shitton of USB-C ports (in the same way that an "Ethernet router" is just a computer with a shitton of Ethernet ports). I suspect the biggest hurdle would actually just be sourcing an SoC with a lot of switching fabric but not a lot of compute. Like, you'd need Threadripper levels of connectivity but with like, one or two actual CPU cores.
[0] Like, last time I had to swap work laptops, I just plugged a TB cable between them and did an `rsync`.
https://docs.nvidia.com/cuda/gpudirect-rdma/index.html
The "R" in RDMA means there are multiple DMA controllers who can "transparently" share address spaces. You can certainly share address spaces across nodes with RoCE or Infiniband, but thats a layer on top
Edit: to be clear, macOS itself (Cocoa elements) is all SDR content and thus washed out.
The white and black levels of the UX are supposed to stay in SDR. That's a feature not a bug.
If you mean the interface isn't bright enough, that's intended behavior.
If the black point is somehow raised, then that's bizarre and definitely unintended behavior. And I honestly can't even imagine what could be causing that to happen. It does seem like that it would have to be a serious macOS bug.
You should post a photo of your monitor, comparing a black #000 image in Preview with a pitch-black frame from a video. People edit HDR video on Macs, and I've never heard of this happening before.
Which, personally, I find to be extremely ugly and gross and I do not understand why they thought this was a good idea.
Liquid (gl)ass still sucks.
* They already cleared the first hurdle to adoption by shoving inference accelerators into their chip designs by default. It’s why Apple is so far ahead of their peers in local device AI compute, and will be for some time.
* I suspect this introduction isn’t just for large clusters, but also a testing ground of sorts to see where the bottlenecks lie for distributed inference in practice.
* Depending on the telemetry they get back from OSes using this feature, my suspicion is they’ll deploy some form of distributed local AI inference system that leverages their devices tied to a given iCloud account or on the LAN to perform inference against larger models, but without bogging down any individual device (or at least the primary device in use)
For the endgame, I’m picturing a dynamically sharded model across local devices that shifts how much of the model is loaded on any given device depending on utilization, essentially creating local-only inferencing for privacy and security of their end users. Throw the same engines into, say, HomePods or AppleTVs, or even a local AI box, and voila, you’re golden.
EDIT: If you're thinking, "but big models need the higher latency of Thunderbolt" or "you can't do that over Wi-Fi for such huge models", you're thinking too narrowly. Think about the devices Apple consumers own, their interconnectedness, and the underutilized but standardized hardware within them with predictable OSes. Suddenly you're not jamming existing models onto substandard hardware or networks, but rethinking how to run models effectively over consumer distributed compute. Different set of problems.
Not really. llama.cpp was just using the GPU when it took off. Apple's advantage is more VRAM capacity.
this introduction isn’t just for large clusters
It doesn't work for large clusters at all; it's limited to 6-7 Macs and most people will probably use just 2 Macs.
There's definitely something there, but Apple's really the only player setup to capitalize on it via their halo effect with devices and operating systems. Everyone else is too fragmented to make it happen.
I was thinking, "How could we package or run these kinds of large models or workloads across a consumer's distributed compute?" The Engineer in me got as far as "Enumerate devices on network via mDNS or Bonjour, compare keys against iCloud device keys or otherwise perform authentication, share utilization telemetry and permit workload scheduling/balance" before I realized that's probably what they're testing here to a degree, even if they're using RDMA.