FilterHN

Ultrafast machine learning on FPGAs via Kolmogorov-Arnold Networks

282 points

by ag2718

3 days ago

| past

| 15 comments

| aarushgupta.io

| HN

https://web.archive.org/web/20260609200156/https://aarushgup...

▲

Lerc

3 days ago

[-]

Has there been much exploration on how much benefit comes from precision in activation functions in KANs? There's a little niggle in the back of my head that maybe 90% of the benefit of KANs can be gained from a quite small variety of function shapes. Combined with input weighting, I almost feel you could have a representation that scales from a standard relu perceptron though KANs to something with weighted inputs and fancy weighted activation functions.

Mark that out in 2d with axes of input weight precision and activation weight precision, you could perhaps do sweeps to find the best accuracy per parameter bit, or accuracy/speed, or some sweet spot that has a nice balance of operating speed, accuracy, and model size.

▲

ag2718

3 days ago

[-]

There is definitely a precision-performance tradeoff to consider. We explored this through ablation studies on bitwidth precision / resource usage in our work (Figure 6a in https://arxiv.org/pdf/2512.12850, Figure 4 in https://arxiv.org/pdf/2602.02056). Further exploration into the mechanics here would definitely be useful.

Regarding your point that "90% of the benefit of KANs can be gained from a small variety of function shapes": even within the B-spline basis, the shapes are quite uniform. Much of the actual benefit of scaling up the basis size comes from learning more complex, piecewise-polynomial activation functions. Scaling up the number of basis functions (i.e. more granular intervals) also increases locality and allows the activation function's value across different parts of the domain to be learned semi-independently. (There obviously is a tradeoff here with overfitting.)

The number of basis functions (G+S) is largely what determines how expressive the activation is, as it relates to your point: "you could have a representation that scales from a standard relu perceptron though KANs to something with weighted inputs and fancy weighted activation functions."

▲

zipy124

3 days ago

[-]

Can I just say that this is extremely impressive work for a master's level thesis. Incredible work and I hope you manage to continue fulfilling your fantastic potential in your career!

▲

ag2718

2 days ago

[-]

Thank you :)

▲

hodgehog11

3 days ago

[-]

The benefit in KANs is interpretability, not expressivity. It's a structure that lends itself well to performing symbolic regression or other interpretable downstream tasks. This can make it better suited for scientific tasks, for example. You can easily replicate the practical performance of any KAN with an MLP, and it will train and run faster on modern architectures. This proposes a method it might be faster, but it's early days to me.

Precision in the activation function is targetting a part of neural networks that you don't want. There are many other methods that work with high precision. You use neural networks because of their implicit bias toward regular solutions. That means there is a sweet spot at low precision that you're targetting.

▲

ag2718

3 days ago

[-]

A key benefit of KANs is expressivity, as each layer is significantly more expressive than an MLP layer. This can be seen in our benchmarks: KAN networks need fewer layers than MLPs to match or beat their performance, even in software.

However, on GPUs, KAN implementations are far less efficient than MLPs: since B-spline locality is hard to exploit and lookup operations aren't as efficient. This is your original point about MLPs training and running faster on modern architectures: each KAN layer is more expressive, but its poor hardware efficiency makes it a net negative (at least for current approaches).

On FPGAs, LUT lookups are cheap, so KANs' expressive layers map to very hardware-efficient implementations, and the resulting networks are thus much more compact and efficient than equivalent MLPs.

On your second point: low precision is certainly viable for both inference and learning (as shown in our work), and quantization can even have a mild regularizing effect. However, task performance generally worsens with lower precision (here and across the literature): the use of low precision is fundamentally a result of the efficiency-performance tradeoff.

▲

hodgehog11

3 days ago

[-]

I generally agree with this rebuttal. Each KAN layer is more expressive on a per-layer basis, although there is a mapping to an MLP with more layers. With the current hardware implementations, yes, MLPs have an advantage overall. I can certainly respect the intention to make KANs faster, since it is a serious issue for more widespread adoption, and KANs certainly have their value.

I'm still very skeptical of arguing for KANs as an eventual replacement, like I've seen some papers on the subject argue. The reduced depth may not be an advantage. For example, higher depth for standard neural networks doesn't just add to expressivity, it actually induces spectral sparsity bias. KANs have a bias of their own, but it is different, and is sometimes better, sometimes worse, depending on the task. If increasing depth turns out to be important, KANs might remain less efficient overall.

▲

ag2718

2 days ago

[-]

Ah I see, that's an interesting point about higher depth potentially having other benefits. For our work on smaller models (e.g. generally <5 layers), this might not have been as relevant but I would definitely be interested to see implications for much deeper networks. As to your point about KANs performing better or worse depending on the specific task, we definitely did notice this to some extent (symbolic tasks were the best, non-symbolic tasks such as image recognition were the worst).

▲

Lerc

2 days ago

[-]

>symbolic tasks were the best, non-symbolic tasks such as image recognition were the worst

I wonder how much of that is not so much the overall task but the need to build up to a complex state where KANs can excel. If you consider the classic neuralnet edge detector example, it's hard to imagine a KAN doing the task more efficiently, it seems like a necessary task as part of the overall process but delegating a more capable system to a menial task is probably wasting resources.

One layer of conv2d might be enough to turn pixels into something that KANs manage better.

▲

ag2718

2 days ago

[-]

This is definitely true: one could imagine a model with a mix of the two layers or a simple linear / MLP-like kernel doing "preprocessing" before KAN layers. Other work that explores task performances for KANs and MLPs generally finds KANs are worse at non-symbolic tasks, but it would be interesting to see if hybrid architectures could improve on this failure mode.

▲

mikeayles

3 days ago

[-]

So for people wondering if it can be used to accelerate LLM inference, sadly not.

I've been trying to hit 100,000tokens/s with a 3.28m dumb model, and even this is an order of magnitude too large to benefit.

It appears to be focussed more on latency, than throughput. Happy to be corrected?

▲

ssivark

3 days ago

[-]

When aiming for 100k tok/s, you would still have CUDA overheads (on the order of microseconds) -- which might become the bottleneck, even if you do everything else right with the inference architecture. How are you planning to overcome that?

EDIT: Oh, on second read, do you mean you're running the model on an FPGA?

▲

taneq

3 days ago

[-]

You might be conflating throughput with latency. 100k tok/s is very different to 1 tok/10us.

▲

ssivark

3 days ago

[-]

When doing auto regressive inference, how often do you do a CUDA kernel call? What is the main bottleneck at the throughputs you're operating?

▲

ag2718

3 days ago

[-]

You're correct that this work is not very applicable for LLMs and that the focus here is primarily on latency.

▲

ai_fry_ur_brain

3 days ago

[-]

Was anyone thinking this?

▲

scivizlabvienna

2 days ago

[-]

I am using an almost identical architecture of a combination of lut-nn and bitnet on an upcoming fungal network interface which is basically just a metal pole rammed into the forest floor with electrodes at the bottom, fpga lut-nn in between and lora transceiver at the top. Thank you for this paper it will make pitching the concept alot easier using this as a reference :*

▲

ag2718

2 days ago

[-]

That is a really cool application of FPGA-based machine learning that I would not have thought of :)

▲

andai

2 days ago

[-]

Explain like I'm mycelially challenged?

▲

scivizlabvienna

49 minutes ago

[-]

https://host-html.com/p/funginet

▲

scivizlabvienna

14 hours ago

[-]

Avatar movie blue monkeys jack into forest matrix.

But the blue monkeys are metal rods with radio and the forest matrix are forest wide fungal colonies.

▲

RantyDave

3 days ago

[-]

Right. But ... this would limit you to either extremely small models or extremely large FPGA's, yes? If there's a simple machine learning task that requires a sub microsecond latency I can see the point but otherwise??

▲

ag2718

3 days ago

[-]

Yes, this work is focused on accelerating very small models, typically for real-time systems that require extremely low power or low latency.

One primary application of this work is in high-energy physics (https://home.cern/smarter-decisions-at-the-speed-of-collisio...). Ultrafast and real-time learning is also very applicable for problems in quantum computing, plasma control, etc. (https://arxiv.org/pdf/2602.02005).

▲

laughing_man

3 days ago

[-]

Drone target recognition?

▲

poly2it

3 days ago

[-]

I'm not in HFT, but I assume this is also an interesting applicable domain?

▲

UltraSane

3 days ago

[-]

The author actually works at Jane Street.

▲

ag2718

3 days ago

[-]

Yes, definitely: this type of work is applicable in domains where software run on general-purpose processors cannot meet latency or power requirements.

▲

hansvm

2 days ago

[-]

Yes, but simple models are far more expressive than people give them credit for.

As one example, I've shoved <100 parameter networks into driver code before and hand-tuned them to run in 10-20 nanoseconds. E.g., touchpad hardware tends to suck, especially as it ages, sometimes generating thousands of phantom events per second and causing drift and other such issues. Typically that's solved via careful tuning of hysteresis and other parameters, but the problem is actually very amenable to neural nets. It's easy to collect good-enough data en masse, and you can tune precision vs recall to bias heavily toward dropping more events without any issues (doing so has the effect of slightly slowing down the mouse pointer, which you can compensate for at the OS level where you adjust pointer speed) to achieve 100% reduction of the phantom events.

Lots of image recognition tasks ( like spotting undesirable products in industrial settings), image modification tasks (I have some models locally to process hand-drawn images and unwarp them, remove notebook paper lines, etc), audio modification tasks (part of my editing pipeline includes hand-editing audio to achieve some effect, doing that a few times, and training models to copy that edit), and all sorts of other things are similarly doable in much smaller models than you might think -- not as small as that driver code, but still small enough to fit in hobbyist FPGAs.

Not all of those require low latency or high throughput, but audio processing is expensive, so high throughput is nice; industrial applications often operate on fast streams of many products, so both throughput and latency are important; and more generally when you have fast models available (or any fast code really) you'll tend toward different thought patterns and creative ideas which you wouldn't have even considered otherwise and which wouldn't be possible without those faster solutions.

Now that I think about it, we average 1.5M inferences per second at $WORK, expected to scale up 10-30x this year, and we have a moderately tight latency budget. This solution wouldn't fit without a larger, more expensive FPGA, at least not unless KANs are comparatively that much more expressive than our current solution (based on past experimentation, my hunch is that they're not, but you never know), but it's borderline useful.

▲

ag2718

2 days ago

[-]

Some very cool applications of small models! It seems that this scale of models tends to be sufficient when doing simpler classification, anomaly detection, signal processing, etc. as compared to generative modeling (where larger models are usually necessary).

▲

hansvm

2 days ago

[-]

Yep, as a rule of thumb generative models need to be much larger. As a small caveat, that's because of what we're doing with those models; generation itself can also be tiny and fast, but only when the output space is sufficiently constrained. Next-word prediction (in keyboards), speech codecs (TTS, especially for blind people), and a number of other scenarios both admit small models and fall into the domain of what most experts would call "generative."

▲

Cadwhisker

3 days ago

[-]

If you want to experiment with KANs yourself in a non-FPGA environment, there's a GitHub repo here: https://github.com/KindXiaoming/pykan

HN comments page on that is here: https://news.ycombinator.com/item?id=40219205

▲

jeffreysmith

2 days ago

[-]

Super cool work. I love seeing this direction taken all the way to hardware.

I'm a big fan of KANs. The really seem like the start of something big and new. We've got a couple of papers out and in the works on KANs. The most relevant to OP's is this one: https://arxiv.org/abs/2512.15742v2

And we just put up a general primer on KANs on YT: https://youtu.be/wgcSsJ69x1c?si=fiUl1YGTgaTt_bn9 Fun stuff if you want to get into the weeds.

And if you are really interested in KANs, you should really check out Ziming (KAN creator)'s blog: https://kindxiaoming.github.io/blog/

▲

UncleOxidant

2 days ago

[-]

Searching around github and found someone has put up a github repo with a Julia implementation of the article here including FPGA implementation of a KAN MNIST classifier in Verilog. https://github.com/philtomson/KAN_LUT

▲

ag2718

2 days ago

[-]

Our end-to-end implementation can be found here! https://github.com/Duchstf/KANELE

▲

bjourne

2 days ago

[-]

Sorry, I haven't had time to read your papers in full yet. Have you considered that LUTs on many FPGAs aren't 2:1 but instead, say, 6:3 and also may contain flip-flops and muxes? FPGA synthesis may not be as easy as "just" translating the activation functions to LUTs.

▲

tomrod

3 days ago

[-]

Happy to hear that KANs continue to find solid footing.

▲

potato-peeler

3 days ago

[-]

Bit off topic but I have always wondered how is it decided whose names would come first in a paper. You mentioned you and Duc Hoang having equal contribution, so how did you both decide this? Was it that persons idea first or you were his roommate and owe him a beer? Coin toss? I never had an traditional college life. Always wondered about all this.

▲

Epa095

3 days ago

[-]

This differes between fields, sometimes even down to the niche subject. In my subfield (of computer science) it was strictly alphabetical by surname, and the idea was that either you contributed or not, and there is no gradient. In other fields it's 'main author' and everyone else, with the expectation that main author does more. Some have the group leader as the first name, or the 'big shot' is always first. My impression is that in medicine it is often a kind of ranking from 'most to least' main author.

▲

potato-peeler

3 days ago

[-]

> the idea was that either you contributed or not, and there is no gradient

Is the work predefined,ie, how much will each person do?

> Some have the group leader as the first name, or the 'big shot' is always first. My impression is that in medicine it is often a kind of ranking from 'most to least' main author

Does this affect your standing in workplace or industry? When we read about papers in scientific journals, news, or even arxiv, they mostly refer to the work based on first name, like “potato-peeler et al”. Sure it’s for brevity But when you look at the authors, there maybe 10 ppl listed. I have always wondered how do they get recognised. Since Like you mentioned they have to contribute. If they contribute but their names get swallowed up within “et al”, how does someone know how much was their contribution?

▲

Animats

3 days ago

[-]

This guy will be hired by a high-frequency trading firm, and the next time we hear about him, he will have a net worth in 9 figures.

▲

throwaw12

3 days ago

[-]

he is already at Jane Street

▲

Animats