Inside the M4 Apple Neural Engine, Part 1: Reverse Engineering
376 points
by zdw
1 month ago
| 25 comments
| maderix.substack.com
| HN
LatencyKills
1 month ago
[-]
I worked on the Xcode team for years and know the lengths Apple goes to make this stuff difficult to figure out.

I just wanted to say that you’ve done an excellent job and am looking forward to the 3rd installment.

reply
tiffanyh
1 month ago
[-]
Would you mind explaining more.

“Difficult” because of lack of documentation? Or difficult because of purposefully obfuscating things?

reply
LatencyKills
1 month ago
[-]
There's a lot you can do at build time to make reverse engineering harder than just stripping symbol information.
reply
bri3d
1 month ago
[-]
This seems odd to me. I have never seen obfuscation techniques in first party Apple software - certainly not in Espresso or ANECompiler and overall nowhere at all except in media DRM components (FairPlay).

Apple are really the major OS company _without_ widespread use of a first party obfuscator; Microsoft have WarBird and Google have PairIP.

reply
LatencyKills
1 month ago
[-]
> Apple are really the major OS company _without_ widespread use of a first party obfuscator

You might want to look into techniques like control-flow flattening, mixed boolean–arithmetic transformations, opaque predicates, and dead code injection — Apple uses all of these. The absence of a publicly named obfuscator doesn’t mean Apple doesn’t apply these methods (at least during my time there).

Ever wonder why Apple stopped shipping system frameworks as individual .dylib files? Here’s a hint: early extraction tools couldn’t preserve selector information when pulling libraries from the shared cache, which made the resulting decompiled pseudocode unreadable.

reply
bri3d
1 month ago
[-]
I'm very familiar with CFG flattening and other obfuscation techniques, thanks.

That's interesting; I suppose I must not have touched the parts of the platform that use them, and I've touched a fair amount of the platform.

Again, I _have_ seen plenty of obfuscation techniques in DRM/FairPlay, but otherwise I have not, and again, I am entirely sure the ANE toolchain from CoreML down through Espresso and into AppleNeuralEngine.framework definitely does not employ anything I would call an obfuscation technique.

> Ever wonder why Apple stopped shipping system frameworks as individual .dylib files?

If the dyld cache was supposed to be an obfuscation tool, shipping the tools for it as open source was certainly... a choice. Also, the reason early tools couldn't preserve selector information was selector uniqueing, which was an obvious and dramatic performance improvement and explained fairly openly, for example - http://www.sealiesoftware.com/blog/archive/2009/09/01/objc_e... . If it was intended to be an obfuscation tool, again it was sort of a baffling one, and I just don't think this is true - everything about the dyld cache looks like a performance optimization and nothing about it looks like an obfuscator.

reply
LatencyKills
1 month ago
[-]
I’m still relatively new to HN, but I continue to find it fascinating when people share their perspectives on how things work internally. Before joining Apple, I was a senior engineer on the Visual Studio team at Microsoft, and it's amazing how often I bump into people who hold very strong yet incorrect assumptions about how systems are built and maintained.

> I suppose I must not have touched the parts of the platform that use them

It’s understandable not to have direct exposure to every component, given that a complete macOS build and its associated applications encompass tens of millions of lines of code. /s

That said, there’s an important distinction between making systems challenging for casual hackers to analyze and the much harder (if not impossible) goal of preventing skilled researchers from discovering how something works.

> Also, the reason early tools couldn't preserve selector information was selector uniqueing

That isn't even remotely how we were making things difficult back then.

I led the SGX team at Intel for a while, working on in-memory, homomorphic encryption. In that case, the encryption couldn’t be broken through software because the keys were physically fused into the CPU. Yet, a company in China ultimately managed to extract the keys by using lasers to remove layers of the CPU die until they could read the fuses directly.

I’ll wrap up by noting that Apple invests extraordinary effort into making the critical components exceptionally difficult to reverse-engineer. As with good obfuscation—much like good design or craftsmanship—the best work often goes unnoticed precisely because it’s done so well.

I'm done here - you go on believing whatever it is you believe...

reply
ghshephard
1 month ago
[-]
I'm thoroughly enjoying this thread by the way, between someone who is clearly informed and educated in platform research, and pretty enthusiastic and interested in the field, and yourself - an deeply experienced engineer with truly novel contributions to the conversation that we don't often see.

Looking very forward to more of your insight/comments. Hopefully your NDA has expired on some topic that you can share in detail!

reply
LatencyKills
1 month ago
[-]
Thank you for your comment. I started this thread just as a simple "job well done" to the authors. I didn't expect to be told that my work doesn't exist. ;-)

No one ever notices plastic surgery when it is done well. The same can be true for obfuscation. But, as I indicated, no amount of obfuscation is foolproof when dealing with experienced, well-funded attackers. The best you can do is make their task annoying.

reply
saagarjha
1 month ago
[-]
The codenames are cute but don’t really do much
reply
asimovDev
1 month ago
[-]
What kind of skillset would one need to work there? I really want to get hired there and stuff to the vim emulation in Xcode
reply
dostick
1 month ago
[-]
Apply on their website, they’ve been looking and I got interview just being iOS/macOS developer, no tools development exp.
reply
asimovDev
1 month ago
[-]
https://jobs.apple.com/en-us/details/200586465-0836/xcode-in...

I was mostly joking, I am not from the US and not skilled enough to be considered for bothering with creating a visa for me when there are thousands of developers much more fit for this in the USA. But it is neat to see that the requirements are not as intense as I would've expected

reply
RetpolineDrama
1 month ago
[-]
>I worked on the Xcode team for years

Why did you guys remove the ability to detach the console and move it to another window?

reply
vdivyanshu
1 month ago
[-]
I went digging down the rabbit hole over the last 6 hours on what compute around training can be extracted from M4/M5 Neural Engine chips: - was able to offload @karpathy's NanoGpt training run(partially) on Apple Neural Engine. - moved the Classifier & Softmax layers directly onto the ANE - Classifier is 10x faster, and Softmax is 34x faster - fixed memory exhaustion: original repo had an ARC memory leak that capped training at ~119 compile loads per process. - patched the C-bridge, allowing continuous, stable training

Repo - https://github.com/vipuldivyanshu92/ANEgpt

reply
3abiton
1 month ago
[-]
That's the best kind of "benders"
reply
bytesandbits
1 month ago
[-]
incredible work
reply
eleventyseven
1 month ago
[-]
> Throughout this series, “we” refers to maderix (human) and Claude Opus 4.6 (by Anthropic) working as a pair. The reverse engineering, benchmarking, and training code were developed collaboratively

Sure, "collaboratively." Why would I ever trust a vibe coded analysis? How do I, a non expert in this niche, know that Opus isn't pulling a fast one on both of us? LLMs write convincing bullshit that even fools experts. Have you manually verified each fact in this piece? I doubt it. Thanks for the disclaimer, it saved me from having to read it.

reply
brookst
1 month ago
[-]
You’d feel better if it was two people you don’t know? Because obviously any random person is 100% accurate, never mistaken, never making shit up?

I don’t understand the mindset, I really don’t. Why are humans held to such a lower standard?

reply
ezst
1 month ago
[-]
Despite all the anthropomorphizing of LLMs, you must have come across already how each has VERY DISTINCT failure modes?
reply
brookst
1 month ago
[-]
Actually… no. Now that you mention it, and thanks for the interesting thought, the failure modes seem pretty similar to me.

Shoddy research / hallucination, tendency to lose the thread, lack of historical / background context… the failure modes are at least qualitatively similar.

Show me an LLM failure and I’ll show you a high profile journalist busted for the same thing. And those are humans who focus on these things!

reply
michaelmrose
1 month ago
[-]
Humans as a class are error prone but some humans in their respective fields are very very good. It's often not terribly hard to figure out based on resume and credentials who these folks are and as a shortcut we can look for markers in terms of terminology specifics confidence if it's less important like deciding what to read vs cancer care for your mom.

AI can trip all the right searches to fool these shortcuts whilst sometimes being entirely full of shit and they have no resume nor credentials to verify should we desire to check.

If you have such and vouch for it I can consider your trustworthiness rather than its. If you admit you yourself are reliant on it then this no longer holds

reply
Anonbrit
1 month ago
[-]
Humans also write endless amounts of convincing bullshit, and have done since time immemorial. False papers and faked results have been a growing scourge in academia before LLMs were a thing, and that's just counting the intentional fraud - the reproducibility crisis in science, especially medical and psychological science, affects even the best designed and well intentioned of studies.

Humans also make mistakes and assumptions while reverse engineering, so it will always need more engineers to go through the results, test things

reply
withinboredom
1 month ago
[-]
Claude likes to hide bad benchmarks from you, so it will show you where you are clearly winning. You even see some weird benchmarks in the article.
reply
maderix
1 month ago
[-]
Benchmarks all in part 2. Training progress in part 3(upcoming) Also I think AI human collaboration is important for goal management. Sure LLMs bullshit all the time, but that's the role of the human to create good goals and gating criteria to what constitutes as good.
reply
this-is-why
1 month ago
[-]
Agreed. Now is our chance to start pushing back on this. Don’t patronize this. Just glad author admitted it. Next time they won’t tho.
reply
Octoth0rpe
1 month ago
[-]
Part 2 has benchmarks: https://maderix.substack.com/p/inside-the-m4-apple-neural-en...

6.6 FLOPS/W, plus the ability to completely turn off when not in use, so 0W at idle.

reply
AceJohnny2
1 month ago
[-]
But not 38 TOPS that Apple claims, with the weak explanation of

> Apple’s “38 TOPS INT8” is computed as 19 TFLOPS FP16 × 2, following the industry convention of counting INT8 operations as 2× the FP16 rate. But the hardware doesn’t actually execute INT8 operations twice as fast.

Why would Apple follow that convention when the hardware explicitly doesn't seems like a more straight-faced lie that I expect from Apple

reply
Shebanator
1 month ago
[-]
You assume the marketing folks actually talk with the hardware folks. More likely its a big game of telephone....
reply
AceJohnny2
1 month ago
[-]
there's an apocryphal story that when one of Apple's chips was nearing 10B transistors, marketing asked the chip folks if they could round it up to 10B for their copy. The chip folks were confounded, and said no they didn't have any uncounted transistors to round it up, and they didn't approve of claiming 10B transistors when it wasn't.

(This was a while ago. I see the M4 is at 28 B)

Which is why I'm all the more surprised that Apple would claim 2x more ANE TOPS than it can really does.

reply
Sephr
1 month ago
[-]
You're off by a factor of a trillion. It's 6.6 TFLOPS/W.
reply
Octoth0rpe
1 month ago
[-]
Well, better to be off by that much here than on my next jira ticket size estimate.

thanks

reply
zozbot234
1 month ago
[-]
Much of this information we already knew the very basics of from documentation of the M1/M2 ANE as accessed via bare-metal from Asahi Linux, but it's nice to see confirmation and it being explored in further depth. Note that according to OP Parts 1/2 for very large matmuls CoreML adds little to no overhead compared to the lower-level interface, so there seems to be plenty of scope for supporting ANE for prefill in local AI frameworks. Decode is generally memory-bandwidth limited unless context is very large, and the ANE requires special handling (converting from matmul to 1x1 convolution as described here is wasteful of memory bandwidth, as is potentially dequantizing to INT8/FP16 in memory) so it's less of a clear win.
reply
GeekyBear
1 month ago
[-]
The recent news is that Apple is supposedly replacing the Core ML framework with an updated version that will make it easier to integrate third party LLMs into your apps.

> the company is also planning a few other software-based AI upgrades, including a new framework called Core AI. The idea is to replace the long-existing Core ML with something a bit more modern.

https://www.bloomberg.com/news/newsletters/2026-03-01/apple-...

reply
reverius42
1 month ago
[-]
I scoffed, thinking "more modern? It's pretty recent, right?" and then I realized it's coming up on 10 years old and in AI years that's like 70 years, isn't it.

I wonder to what extent this is a branding exercise; the framework that will replace Core ML could have just as easily been called "Core ML", except the current hotness is "AI" and not "ML".

reply
GeekyBear
1 month ago
[-]
If I had to guess, I'd say that Core AI is going to be an ease of use wrapper around MLX.
reply
behnamoh
1 month ago
[-]
It's insane that the source code of ANE is not available even to the MLX team, possibly one of the reasons Awni (MLX project head) left Apple.
reply
bri3d
1 month ago
[-]
This doesn’t seem that weird to me?

* They haven’t said the source isn’t available to them, just that the closed nature of the ANE means they can’t use it in OSS.

* They’ve repeated constantly that it can’t do backprop and isn’t useful for most MLX use cases.

And really, ANE isn’t even that interesting for MLX really; it’s a limited resource power efficient inference engine for smallish edge models. If you want to use it you can use the Apple APIs, which while limited are generally “shaped” like what you’d want to do anyway. Almost every “biggish” CPU has one of these now and Apple don’t want to give away the specifics of theirs (even though it’s been pretty thoroughly RE’d by real REs and re-summarized by Claude, like this article).

reply
blobbers
1 month ago
[-]
Can someone help me understand when these neural engines kick in in open source software?

I typically use python ML libraries like lightgbm, sklearn, xgboost etc.

I also use numpy for large correlation matrices, covariance etc.

Are these operations accelerated? Is there a simple way to benchmark?

I see a lot of benchmarks on what look like C functions, but today in my jobs I rely on higher level libraries. I don't know if they perform any better on apple HW, and unless they have a flag like use_ane I'm inclined to think they do better.

Of course chatgpt suggested I benchmark an Intel Mac vs. newer apple silicon. Thanks chatgpt, there's a reason people still hate AI.

reply
zozbot234
1 month ago
[-]
> when these neural engines kick in in open source software?

It mostly doesn't because NPUs are bespoke and vendor-specific (which incents neglect by software devs working on open source numerics and ML/AI infrastructure), and the Apple ANE is no exception. Part of this effort is most likely about fixing that for the specific case of the Apple ANE.

reply
blobbers
1 month ago
[-]
Part of which effort? The Reverse engineering is so it can be used blog article?

I just think: great it seems like I'm paying for a hardware accelerator that makes Siri go faster. And I use siri on my laptop exactly 0 times in the last infinite years.

reply
bri3d
1 month ago
[-]
It also makes a lot of really useful features like on device OCR, captions, voice isolation, temporal antialiasing in metalfx, an enormous host of things in the apple pro apps, etc. work
reply
blobbers
1 month ago
[-]
Yeah, I don't use any of those features. So it sounds like its for folks who are creatives running lightroom or apple movie, or some kind of apple sound program?

I'm a dev, not a creative, unfortunately. I don't use other people's software, I generally write my own (or used to before Claude took over my world).

reply
saagarjha
1 month ago
[-]
You bought a truck. Obviously there will be some part of it you don’t use.
reply
blobbers
1 month ago
[-]
I think its more like it advertises it can climb hills in crawl mode, but turns out its only specific hills.

So fundamentally, it still comes down to CPUs + RAM.

reply
this-is-why
1 month ago
[-]
Yes. Numpy will accelerate if it detects hardware that it supports.
reply
blobbers
1 month ago
[-]
I can't find any docs that numpy will do this.

https://opensource.apple.com/projects/mlx/ is needed to do this?

reply
this-is-why
1 month ago
[-]
Normally I’d help a bro out but I started and googled and got hundreds of results and realized why does everyone need to be spoon fed. Please go do some work yourself mkay?
reply
notepad0x90
1 month ago
[-]
I've been guilty of this myself, but every other comment here is like "What about <insert something unrelated to the topic but related to apple>".
reply
saagarjha
1 month ago
[-]
I think they’re fairly reasonable now at least
reply
msie
1 month ago
[-]
I remember the good old days when Apple was desperate for developers and produced great documentation and there were a lot of great 3rd-party books too. You can't just give out awards in hopes that someone will make that great app.
reply
pstuart
1 month ago
[-]
Yeah, the Inside Macintosh guides were epic.
reply
msie
1 month ago
[-]
And it's not like Apple can't spend several million to bring back that first class documentation. People actually spent money on those books too.
reply
love2read
1 month ago
[-]
This article was clearly written by a human (and AI) but still has a few "LLMisms" such as:

- The key insight - [CoreML] doesn't XXX. It YYY.

With that being said, this is a highly informative article that I enjoyed thoroughly! :)

The article links to their own Github repo: https://github.com/maderix/ANE

reply
walthamstow
1 month ago
[-]
We've got about a year before so many people are interacting with LLMs on a daily basis that its style starts to reverse infect human speech and writing
reply
baxtr
1 month ago
[-]
Great insight – Would you like to try and identify some specific "AI-isms" that you've noticed creeping into your own writing or your colleagues' emails lately?
reply
srini_reddy
1 month ago
[-]
People are okay to use delve now.
reply
pixl97
1 month ago
[-]
This said, there were people that talked like this before LLMs, it didn't develop this whole cloth.
reply
DrScientist
1 month ago
[-]
Exactly. LLM's are mimics.

People seem to be going around pointing out that people talk like parrots, when in reality it's parrots talk like people.

reply
pixl97
1 month ago
[-]
I mean, it's both.

Did you develop your own whole language at any point to describe the entire world? No, you, me, and society mimic what is around us.

Humans have the advantage, at least at this point, of being a continuous learning device so we adapt and change with the language use around us.

reply
pcrh
1 month ago
[-]
The article above doesn't read well, at all.

It's not my subject, but it reads as a list of things. There's little exposition.

reply
dylan604
1 month ago
[-]
Gawd Damn LISTICLES!!!! And all of those articles that list in bullet points at the top of the article the summary of the article. And all of those people saying they don't want to read exposition, just give me the bullet points.
reply
Angostura
1 month ago
[-]
My honest take? You're probably right
reply
sholladay
1 month ago
[-]
You are absolutely right.

Here is why you are correct:

- I see what you did there.

- You are always right.

reply
gogopromptless
1 month ago
[-]
It's already happened to me. I've started to have dreams where instead of some sort of interpersonal struggle the entire dream is just a chatbot UI viewport and I'm arguing with an LLM streaming the responses in. Which is super trippy when I become aware its a dream. In the old days I'd dream about playing chess against myself and lose which was quite bizzare feeling because my brain was running both players. But thats totally normal compared to having my brain pretend to be an LLM inside a dream.
reply
dumpsterdiver
1 month ago
[-]
What’s the intent of pointing out the presumed provenance in writing, now that LLMs are ubiquitous?

Is it like one of those “Morning” nods, where two people cross paths and acknowledge that it is in fact morning? Or is there an unstated preference being communicated?

Is there any real concern behind LLMs writing a piece, or is the concern that the human didn’t actually guide it? In other words, is the spirit of such comments really about LLM writing, or is it about human diligence?

That begs another question: does LLM writing expose anything about the diligence of the human, outside of when it’s plainly incorrect? If an LLM generates a boringly correct report - what does that tell us about the human behind that LLM?

reply
rafram
1 month ago
[-]
Also the Prior Art section, which has telltale repetition of useless verbs like "documenting," "providing insight into," and "confirming" on each line. This was definitely AI-written, at least in part.
reply
tzs
1 month ago
[-]
Below are the items from that section. How should they be written to not look like an AI?

> hollance/neural-engine — Matthijs Hollemans’ comprehensive community documentation of ANE behavior, performance characteristics, and supported operations. The single best existing resource on ANE.

> mdaiter/ane — Early reverse engineering with working Python and Objective-C samples, documenting the ANECompiler framework and IOKit dispatch.

> eiln/ane — A reverse-engineered Linux driver for ANE (Asahi Linux project), providing insight into the kernel-level interface.

> apple/ml-ane-transformers — Apple’s own reference implementation of transformers optimized for ANE, confirming design patterns like channel-first layout and 1×1 conv preference.

reply
leoedin
1 month ago
[-]
The AI-ism that annoys me the most is the unnecessary hubris. Just sampling a small portion of the linked article:

"Here’s the fascinating part:", "And one delightful discovery: "

Personally I find the AI-isms take away from the voice of the author. What does the author find interesting? What was their motivation? It's all lost in a sea of hubris and platitudes.

There's almost certainly a positive side - technical people who aren't so good at communication can now write punchy deep-tech blogs. But what's lost is the unique human voice that is normally in every piece of writing. It's like every blog is rewritten by a committee of copywriters before it's published. Bleurgh.

reply
qaadika
1 month ago
[-]
The grammatical structure in the middle two is identical, and they're all similar in that way.

- "- Name - {Noun with modifiers} {comma} {verb-ing with modifiers}."

- "- Name - {Noun with modifiers} {comma} {verb-ing with modifiers}."

The phrasing is the same, which I notice sometimes happens in my own notes, but it's most noticeable when an LLM is asked to summarize items. An LLM written job description (without major prompting) for a resume comes out the same way, in my experience. It's the simplest full-sentence grammar for describing what something is, and then what something does.

If we used the developer's descriptions (from the github repo) to populate the info, it would look like this:

- hollance/neural-engine - Everything we actually know about the Apple Neural Engine (ANE)

- mdaiter/ane - Reverse engineered the Apple Neural Engine, with working Python and Objective C samples

- apple/ml-ane-transformers - eiln/ane - Reverse engineered Linux driver for the Apple Neural Engine (ANE).

- apple/ml-ane-transformers - Reference implementation of the Transformer architecture optimized for Apple Neural Engine (ANE)

IMO It may not be as information-packed as the LLM list, but it is more interesting to read. I can tell, or at least think I can tell, that different individuals wrote each description, and it's what they wanted me to know most about their project.

If I were making a list of software during research (that would eventually turn into a report), the particular details I write down in the moment would be different, depending on the solution I'm looking for or features it has or doesn't have, will add or won't add. I don't try to summarize "the Whole Project" in one clean bullet point, i (or my readers) can re-read the repo for that, or glean it from surrounding context (presuming enough surrounding context was written). But unless I made an effort later to normalize the list, the grammar, length, and subpoints would vary from the form-identifiable "LLM Concise Summary." It's more work for me to write to a standard, and even more work to consciously pick one.

EDIT: Upon re-reading the article, I noticed the "Prior Art" section is written in past-tense, as I would expect. But the list is in present tense. I feel like it jumps from "narrative" to "technical details list" back to "narrative". And the list is 70% of the section! I wouldn't mind reading a whole paragraph describing each project, what worked, what didn't, what they could use and what they couldn't, in the past tense, if it were interestingly-written. Something that tells me the author dove into the previous projects, experimented with them, or if they interacted with the developers. Or something interesting the author noticed while surveying the "prior art". but "interestingly-written" isn't really the LLMs goal, nor its ability. It's maximal information transfers with minimal word count. So the result is a list that smells like the author merely read the repo readme and wrote a summary for the masses in a technical report.

tl;dr The list is just "a list", and that makes it not interesting to read. If it was not interesting to read it was probably not interesting to write, which I take as an LLM writing it.

reply
nbardy
1 month ago
[-]
Why does apple want to make this hardware hard to access?

What actual benefits do they get?

I guess they can have their own models run faster than the competition on their hardware? But they don't even really have anything that consumers use on the ANE as far as I can tell and local LLMs are taking off on macs and could really benefit from this

reply
owlbite
1 month ago
[-]
I suspect main benefits are they have no need to maintain the hardware or software for any longer than it makes sense for their own needs, and don't have to handhold users through a constantly evolving minefield of performance and technical capabilities.
reply
instahotstar
1 month ago
[-]
Really impressive reverse engineering work. I’m curious how much of the Neural Engine’s instruction set is undocumented versus inferred experimentally. Also wondering how Apple balances power efficiency vs peak throughput in the M4 compared to previous generations.
reply
cedws
1 month ago
[-]
I’m surprised that Claude assisted with this reverse engineering work. I used Codex recently for a similar purpose and got an account warning. Initially refused to do it, and then I was able to trick it. Seems I might have to make the jump back.
reply
asimovDev
1 month ago
[-]
wow really? I used Copilot for reverse engineering an application couple months back and it was eager to help, happily consuming wireshark logs and whatnot
reply
cedws
1 month ago
[-]
Yes, apparently OpenAI is super paranoid about 5.3 Codex being abused so if you trip their safety mechanism you have to verify your ID.
reply
mattlangston
1 month ago
[-]
The future is bright for software engineers.

The big takeaway isn't reverse engineering the ANE per se, but what Manjeet could do with his software engineering skills when accelerated by AI.

This is a good example of the present state of software engineering. Not future state - present state.

reply
giancarlostoro
1 month ago
[-]
Reverse Engineering with AI is only going to get better. I have seen some crazy things friends of mine have done with Claude alone. Let's just says SaaS isn't the only industry that could one day suffer.
reply
Geee
1 month ago
[-]
Is it really worth having separate GPU and NE? Seems redundant and weird compared to what Nvidia is doing, i.e. "GPUs are good NEs", or is that not really true?
reply
jasonwatkinspdx
1 month ago
[-]
No, GPUs are not what you'd design for neural networks from first principles. They were adopted for that because they offered far more parallelism than general purpose cpus, not because they're ideal. That's why Google et all designed TPUs that have a very different internal structure.

Most TPU designs have been based around systolic arrays, which for matrix ops have a quadratic speedup. A typical design is a 128x128 array of MAC units. You shift weights along one dimension, parameters along the other. It takes 128 cycles to shift a full matrix input in, then 128 cycles to shift the answer back out, but during those 256 cycles you got 16,384 MAC operations done, for a factor of 64 speedup.

The other big appeal of this design is it's way simpler than GPUs. The memory access patterns are predictable, there's no threads or thread divergence, etc. So it can be way more efficient in silicon, not just in area but especially in power efficiency.

There's other ideas for architectures besides this basic systolic array idea. If you want to learn about them, a good place would be the HotChips presentations of the last few years: https://hc2025.hotchips.org and similar domain names for prior years.

reply
0-_-0
1 month ago
[-]
When you already have a GPU in a system, adding tensor cores to it is much more efficient than adding a separate NPU which needs to replicate all the data transfer pipelines and storage buffers that the GPU already has. Besides, Nvidia's tensor cores are systolic.
reply
jasonwatkinspdx
1 month ago
[-]
No, if that were the case, then Google would have made GPUs + NN cores vs TPUs.

There's far more microarchitectural complexity in GPUs that actually isn't efficient for NN structures.

"Systolic array" actually means something more specific than "repeated structures on a die."

Again, I'd suggest referencing the various HotChips presentations. It's a really interesting topic area. Or the original TPU v1 paper for the basics.

reply
0-_-0
28 days ago
[-]
Why would Google need graphics functionality to train neural networks?
reply
re-thc
1 month ago
[-]
That’s not what Nvidia is doing.

AMD originally went all in on what you call GPU. It was great for gaming. Not as much for inference.

Nvidia whilst still making it GPU tuned the architecture for AI workloads. Gaming hasn’t improved as much lately.

reply
kamranjon
1 month ago
[-]
I have always wondered if the neural engine could be used for training - pretty excited for part 3 of this to see if the juice is actually worth the squeeze
reply
juancn
1 month ago
[-]
In principle most if not all inference hardware should be usable for training.

Efficiency is the question.

reply
daoistmonk
1 month ago
[-]
Tangential: Is anyone doing something similar to accelerate the support matrix of Linux on anything higher than M2?
reply
grey-area
1 month ago
[-]
If only they could fix the iOS autocomplete, which is getting worse with every iteration.
reply
ericol
1 month ago
[-]
> human intuition driving the exploration

This, a thousand times this.

For me, what AI brings is augmented humans. Just as we don't calculate on paper anymore, what is the reason of doing things by hand when a machine in X times better.

Want to code by hand, as artisans of old? Suit yourself.

I, for one, love the smell of burning chrome.

reply
pklausler
1 month ago
[-]
If "AI" were doing anything more than repeating content from the web without attribution, I might agree with you.
reply
ericol
24 days ago
[-]
It's not exactly that...
reply
rayiner
1 month ago
[-]
Holy crap, 32MB of SRAM on the chip for AI.
reply
FL33TW00D
1 month ago
[-]
Unreadable Claude slop
reply
mayhemducks
1 month ago
[-]
I never realized just how much hardware engineering Apple dedicated to enabling people to type faster with their thumbs!
reply
poszlem
1 month ago
[-]
Genuine question, not trying to throw a shade or anything, but are those cores actually useful with the state of apple intelligence being what it is?
reply
rahkiin
1 month ago
[-]
They are also used by ML models that are deeply integrated in macos and ios without you knowing. Like object and text detection in images.
reply
geerlingguy
1 month ago
[-]
And help in Photos, Final Cut Pro, and other apps.
reply
willis936
1 month ago
[-]
I wish they would (or wouldn't if they are) hook it up to the ios keyboard.
reply
dagmx
1 month ago
[-]
If you strip away the branding, Apple has and continues to ship a ton of algorithms that likely use the ANE and end users can use CoreML to do the same.

Just some things that people will likely take for granted that IIRC Apple have said use the ANE or at least would likely benefit from it: object recognition, subject extraction from images and video, content analysis, ARKit, spam detection, audio transcription.

reply
sroussey
1 month ago
[-]
Don’t forget FaceID and many of the image manipulation.

And while everyone else went to more powerful giant LLMs, Apple moved most of Siri from the cloud to your device. Though they do use both (which you can see when Siri corrects itself during transcription—you get the local Siri version corrected later by the cloud version).

reply
mschuster91
1 month ago
[-]
IIRC, FaceID has been a thing before ML entered the picture.
reply
dagmx
1 month ago
[-]
FaceID was introduced along side the ANE. It was its reason d’être when introduced.
reply
stetrain
1 month ago
[-]
Apple's OSes run a lot of local ML models for many tasks that aren't branded as Apple Intelligence, and they have done so for many years now.
reply
llm_nerd
1 month ago
[-]
reply
malshe
1 month ago
[-]
This is a nice article. Thanks for sharing.
reply
esafak
1 month ago
[-]
You can convert your own ML models to MLX to use them; Apple Intelligence is not the only application.
reply
nullstyle
1 month ago
[-]
MLX does not run on NPUs AFAIK; just gpu and cpu. You have to use CoreML to officially run code on the neural engine.
reply
mirsadm
1 month ago
[-]
Even then there is no transparency on how it decides what runs on the ANE/GPU etc
reply
sroussey
1 month ago
[-]
Correct. OS level stuff get first priority, so you can’t count on using it.
reply
znagengast
1 month ago
[-]
Turns out third party actually gets priority for ANE
reply