The path to ubiquitous AI (17k tokens/sec)
238 points
2 hours ago
| 57 comments
| taalas.com
| HN
dust42
1 hour ago
[-]
This is not a general purpose chip but specialized for high speed, low latency inference with small context. But it is potentially a lot cheaper than Nvidia for those purposes.

Tech summary:

  - 15k tok/sec on 8B dense 3bit quant (llama 3.1) 
  - limited KV cache
  - 880mm^2 die, TSMC 6nm, 53B transistors
  - presumably 200W per chip
  - 20x cheaper to produce
  - 10x less energy per token for inference
  - max context size: flexible
  - mid-sized thinking model upcoming this spring on same hardware
  - next hardware supposed to be FP4 
  - a frontier LLM planned within twelve months
This is all from their website, I am not affiliated. The founders have 25 years of career across AMD, Nvidia and others, $200M VC so far.

Certainly interesting for very low latency applications which need < 10k tokens context. If they deliver in spring, they will likely be flooded with VC money.

Not exactly a competitor for Nvidia but probably for 5-10% of the market.

Back of napkin, the cost for 1mm^2 of 6nm wafer is ~$0.20. So 1B parameters need about $20 of die. The larger the die size, the lower the yield. Supposedly the inference speed remains almost the same with larger models.

Interview with the founders: https://www.nextplatform.com/2026/02/19/taalas-etches-ai-mod...

reply
vessenes
1 hour ago
[-]
This math is useful. Lots of folks scoffing in the comments below. I have a couple reactions, after chatting with it:

1) 16k tokens / second is really stunningly fast. There’s an old saying about any factor of 10 being a new science / new product category, etc. This is a new product category in my mind, or it could be. It would be incredibly useful for voice agent applications, realtime loops, realtime video generation, .. etc.

2) https://nvidia.github.io/TensorRT-LLM/blogs/H200launch.html Has H200 doing 12k tokens/second on llama 2 12b fb8. Knowing these architectures that’s likely a 100+ ish batched run, meaning time to first token is almost certainly slower than taalas. Probably much slower, since Taalas is like milliseconds.

3) Jensen has these pareto curve graphs — for a certain amount of energy and a certain chip architecture, choose your point on the curve to trade off throughput vs latency. My quick math is that these probably do not shift the curve. The 6nm process vs 4nm process is likely 30-40% bigger, draws that much more power, etc; if we look at the numbers they give and extrapolate to an fp8 model (slower), smaller geometry (30% faster and lower power) and compare 16k tokens/second for taalas to 12k tokens/s for an h200, these chips are in the same ballpark curve.

However, I don’t think the H200 can reach into this part of the curve, and that does make these somewhat interesting. In fact even if you had a full datacenter of H200s already running your model, you’d probably buy a bunch of these to do speculative decoding - it’s an amazing use case for them; speculative decoding relies on smaller distillations or quants to get the first N tokens sorted, only when the big model and small model diverge do you infer on the big model.

Upshot - I think these will sell, even on 6nm process, and the first thing I’d sell them to do is speculative decoding for bread and butter frontier models. The thing that I’m really very skeptical of is the 2 month turnaround. To get leading edge geometry turned around on arbitrary 2 month schedules is .. ambitious. Hopeful. We could use other words as well.

I hope these guys make it! I bet the v3 of these chips will be serving some bread and butter API requests, which will be awesome.

reply
Gareth321
16 minutes ago
[-]
I think the next major innovation is going to be intelligent model routing. I've been exploring OpenClaw and OpenRouter, and there is a real lack of options to select the best model for the job and execute. The providers are trying to do that with their own models, but none of them offer everything to everyone at all times. I see a future with increasingly niche models being offered for all kinds of novel use cases. We need a way to fluidly apply the right model for the job.
reply
btown
31 minutes ago
[-]
For speculative decoding, wouldn’t this be of limited use for frontier models that don’t have the same tokenizer as Llama 3.1? Or would it be so good that retokenization/bridging would be worth it?
reply
joha4270
51 minutes ago
[-]
The guts of a LLM isn't something I'm well versed in, but

> to get the first N tokens sorted, only when the big model and small model diverge do you infer on the big model

suggests there is something I'm unaware of. If you compare the small and big model, don't you have to wait for the big model anyway and then what's the point? I assume I'm missing some detail here, but what?

reply
connorbrinton
22 minutes ago
[-]
Speculative decoding takes advantage of the fact that it's faster to validate that a big model would have produced a particular sequence of tokens than to generate that sequence of tokens from scratch, because validation can take more advantage of parallel processing. So the process is generate with small model -> validate with big model -> then generate with big model only if validation fails

More info:

* https://research.google/blog/looking-back-at-speculative-dec...

* https://pytorch.org/blog/hitchhikers-guide-speculative-decod...

reply
sails
3 minutes ago
[-]
See also speculative cascades which is a nice read and furthered my understanding of how it all works

https://research.google/blog/speculative-cascades-a-hybrid-a...

reply
speedping
24 minutes ago
[-]
Verification is faster than generation, one forward pass for verification of multiple tokens vs a pass for every new token in generation
reply
vanviegen
27 minutes ago
[-]
I don't understand how it would work either, but it may be something similar to this: https://developers.openai.com/api/docs/guides/predicted-outp...
reply
ml_basics
14 minutes ago
[-]
They are referring to a thing called "speculative decoding" I think.
reply
cma
28 minutes ago
[-]
When you predict with the small model, the big model can verify as more of a batch and be more similar in speed to processing input tokens, if the predictions are good and it doesn't have to be redone.
reply
elternal_love
1 hour ago
[-]
Were we go towards really smart roboters. It is interesting what kind of diferent model chips they can produce.
reply
varispeed
1 hour ago
[-]
There is nothing smart about current LLMs. They just regurgitate text compressed in their memory based on probability. None of the LLMs currently have actual understanding of what you ask them to do and what they respond with.
reply
bsenftner
27 minutes ago
[-]
We know that, but that does not make them unuseful. The opposite in fact, they are extremely useful in the hands of non-idiots.We just happen to have a oversupply of idiots at the moment, which AI is here to eradicate. /Sort of satire.
reply
small_model
1 hour ago
[-]
Thats not how they work, pro-tip maybe don't comment until you have a good understanding?
reply
fyltr
48 minutes ago
[-]
Would you mind rectifying the wrong parts then?
reply
retsibsi
23 minutes ago
[-]
Phrases like "actual understanding", "true intelligence" etc. are not conducive to productive discussion unless you take the trouble to define what you mean by them (which ~nobody ever does). They're highly ambiguous and it's never clear what specific claims they do or don't imply when used by any given person.

But I think this specific claim is clearly wrong, if taken at face value:

> They just regurgitate text compressed in their memory

They're clearly capable of producing novel utterances, so they can't just be doing that. (Unless we're dealing with a very loose definition of "regurgitate", in which case it's probably best to use a different word if we want to understand each other.)

reply
mhl47
22 minutes ago
[-]
The fact that the outputs are probabilities is not important. What is important is how that output is computed.

You could imagine that it is possible to learn certain algorithms/ heuristics that "intelligence" is comprised of. No matter what you output. Training for optimal compression of tasks /taking actions -> could lead to intelligence being the best solution.

This is far from a formal argument but so is the stubborn reiteration off "it's just probabilities" or "it's just compression". Because this "just" thing is getting more an more capable of solving tasks that are surely not in the training data exactly like this.

reply
100721
39 minutes ago
[-]
Huh? Their words are an accurate, if simplified, description of how they work.
reply
beyondCritics
30 minutes ago
[-]
Just HI slop. Ask any decent model, it can explain what's wrong this this description.
reply
bsenftner
29 minutes ago
[-]
Do not overlook traditional irrational investor exuberance, we've got an abundance of that right now. With the right PR manouveurs these guys could be a tulip craze.
reply
zozbot234
1 hour ago
[-]
Low-latency inference is a huge waste of power; if you're going to the trouble of making an ASIC, it should be for dog-slow but very high throughput inference. Undervolt the devices as much as possible and use sub-threshold modes, multiple Vt and body biasing extensively to save further power and minimize leakage losses, but also keep working in fine-grained nodes to reduce areas and distances. The sensible goal is to expend the least possible energy per operation, even at increased latency.
reply
dust42
1 hour ago
[-]
Low latency inference is very useful in voice-to-voice applications. You say it is a waste of power but at least their claim is that it is 10x more efficient. We'll see but if it works out it will definitely find its applications.
reply
zozbot234
52 minutes ago
[-]
This is not voice-to-voice though, end-to-end voice chat models (the Her UX) are completely different.
reply
dust42
42 minutes ago
[-]
I haven't found any end-to-end voice chat models useful. I had much better results with separate STT-LLM-TTS. One big problem is the turn detection and having inference with 150-200ms latency would allow for a whole new level of quality. I would just use it with a prompt: "You think the user is finished talking?" and then push it to a larger model. The AI should reply within the ballpark of 600ms-1000ms. Faster is often irritating, slower will make the user to start talking again.
reply
aurareturn
1 hour ago
[-]
Don’t forget that the 8B model requires 10 of said chips to run.

And it’s a 3bit quant. So 3GB ram requirement.

If they run 8B using native 16bit quant, it will use 60 H100 sized chips.

reply
dust42
1 hour ago
[-]
> Don’t forget that the 8B model requires 10 of said chips to run.

Are you sure about that? If true it would definitely make it look a lot less interesting.

reply
aurareturn
1 hour ago
[-]
Their 2.4 kW is for 10 chips it seems based on the next platform article.

I assume they need all 10 chips for their 8B q3 model. Otherwise, they would have said so or they would have put a more impressive model as the demo.

https://www.nextplatform.com/2026/02/19/taalas-etches-ai-mod...

reply
audunw
1 hour ago
[-]
It doesn’t make any sense to think you need the whole server to run one model. It’s much more likely that each server runs 10 instances of the model

1. It doesn’t make sense in terms of architecture. It’s one chip. You can’t split one model over 10 identical hardwire chips

2. It doesn’t add up with their claims of better power efficiency. 2.4kW for one model would be really bad.

reply
aurareturn
25 minutes ago
[-]
We are both wrong.

First, it is likely one chip for llama 8B q3 with 1k context size. This could fit into around 3GB of SRAM which is about the theoretical maximum for TSMC N6 reticle limit.

Second, their plan is to etch larger models across multiple connected chips. It’s physically impossible to run bigger models otherwise since 3GB SRAM is about the max you can have on an 850mm2 chip.

  followed by a frontier-class large language model running inference across a collection of HC cards by year-end under its HC2 architecture
https://mlq.ai/news/taalas-secures-169m-funding-to-develop-a...
reply
moralestapia
57 minutes ago
[-]
Thanks for having a brain.

Not sure who started that "split into 10 chips" claim, it's just dumb.

This is Llama 3B hardcoded (literally) on one chip. That's what the startup is about, they emphasize this multiple times.

reply
aurareturn
19 minutes ago
[-]
It’s just dumb to think that one chip per model is their plan. They stated that their plan is to chain multiple chips together.

I was indeed wrong about 10 chips. I thought they would use llama 8B 16bit and a few thousand context size. It turns out, they used llama 8B 3bit with around 1k context size. That made me assume they must have chained multiple chips together since the max SRAM on TSMC n6 for reticle sized chip is only around 3GB.

reply
soleveloper
32 minutes ago
[-]
In 20$ a die, they could sell Gameboy style cartridges for different models.
reply
oliwary
1 hour ago
[-]
This is insane if true - could be super useful for data extraction tasks. Sounds like we could be talking in the cents per millions of tokens range.
reply
baalimago
21 minutes ago
[-]
I've never gotten incorrect answers faster than this, wow!

Jokes aside, it's very promising. For sure a lucrative market down the line, but definitely not for a model of size 8B. I think lower level intellect param amount is around 80B (but what do I know). Best of luck!

reply
freakynit
46 minutes ago
[-]
Holy cow their chatapp demo!!! I for first time thought i mistakenly pasted the answer. It was literally in a blink of an eye.!!

https://chatjimmy.ai/

reply
Etheryte
26 minutes ago
[-]
It is incredibly fast, on that I agree, but even simple queries I tried got very inaccurate answers. Which makes sense, it's essentially a trade off of how much time you give it to "think", but if it's fast to the point where it has no accuracy, I'm not sure I see the appeal.
reply
kaashif
25 minutes ago
[-]
If it's incredibly fast at a 2022 state of the art level of accuracy, then surely it's only a matter of time until it's incredibly fast at a 2026 level of accuracy.
reply
Gud
17 minutes ago
[-]
Why do you assume this?

I can produce total jibberish even faster, doesn’t mean I produce Einstein level thought if I slow down

reply
PrimaryExplorer
19 minutes ago
[-]
yeah this is mindblowing speed. imagine this with opus 4.6 or gpt 5.2. probably coming soon
reply
gwd
33 minutes ago
[-]
I dunno, it pretty quickly got stuck; the "attach file" didn't seem to work, and when I asked "can you see the attachment" it replied to my first message rather than my question.
reply
scosman
16 minutes ago
[-]
It’s llama 3.1 8B. No vision, not smart. It’s just a technical demo.
reply
freakynit
29 minutes ago
[-]
Hmm.. I had tried simple chat converation without file attachments.
reply
amelius
30 minutes ago
[-]
OK investors, time to pull out of OpenAI and move all your money to ChatJimmy.
reply
freakynit
27 minutes ago
[-]
A related argument I raised a few days back on HN:

What's the moat with with these giant data-centers that are being built with 100's of billions of dollars on nvidia chips?

If such chips can be built so easily, and offer this insane level of performance at 10x efficiency, then one thing is 100% sure: more such startups are coming... and with that, an entire new ecosystem.

reply
codebje
11 minutes ago
[-]
RAM hoarding is, AFAICT, the moat.
reply
freakynit
1 minute ago
[-]
lol... true that for now though
reply
zwaps
36 minutes ago
[-]
I got 16.000 tokens per second ahaha
reply
bsenftner
34 minutes ago
[-]
I get nothing, no replies to anything.
reply
freakynit
30 minutes ago
[-]
Maybe hn and reddit crowd have overloaded them lol
reply
elliotbnvl
43 minutes ago
[-]
That… what…
reply
boutell
14 minutes ago
[-]
The speed is ridiunkulous. No doubt.

The quantization looks pretty severe, which could make the comparison chart misleading. But I tried a trick question suggested by Claude and got nearly identical results in regular ollama and with the chatbot. And quantization to 3 or 4 bits still would not get you that HOLY CRAP WTF speed on other hardware!

This is a very impressive proof of concept. If they can deliver that medium-sized model they're talking about... if they can mass produce these... I notice you can't order one, so far.

reply
Normal_gaussian
3 minutes ago
[-]
I doubt many of us will be able to order one for a long while. There is a significant number of existing datacentre and enterprise use-cases that will pay a premium for this.

Additionally LLMs have been tested, found valuable in benchmarks, but not used for a large number of domains due to speed and cost limitations. These spaces will eat up these chips very quickly.

reply
jjcm
41 minutes ago
[-]
A lot of naysayers in the comments, but there are so many uses for non-frontier models. The proof of this is in the openrouter activity graph for llama 3.1: https://openrouter.ai/meta-llama/llama-3.1-8b-instruct/activ...

10b daily tokens growing at an average of 22% every week.

There are plenty of times I look to groq for narrow domain responses - these smaller models are fantastic for that and there's often no need for something heavier. Getting the latency of reponses down means you can use LLM-assisted processing in a standard webpage load, not just for async processes. I'm really impressed by this, especially if this is its first showing.

reply
freakynit
30 minutes ago
[-]
Exactly. One easily relatable use-case is structured content extraction or/and conversion to markdown for web page data. I used to use groq for same (gpt-oss20b model), but even that used to feel slow when doing theis task at scale.

LLM's have opened-up natural language interface to machines. This chip makes it realtime. And that opens a lot of use-cases.

reply
aurareturn
1 hour ago
[-]
Edit: it seems like this is likely one chip and not 10. I assumed 8B 16bit quant with 4K or more context. This made me think that they must have chained multiple chips together since N6 850mm2 chip would only yield 3GB of SRAM max. Instead, they seem to have etched llama 8B q3 with 1k context instead which would indeed fit the chip size.

This requires 10 chips for an 8 billion q3 param model. 2.4kW.

10 reticle sized chips on TSMC N6. Basically 10x Nvidia H100 GPUs.

Model is etched onto the silicon chip. So can’t change anything about the model after the chip has been designed and manufactured.

Interesting design for niche applications.

What is a task that is extremely high value, only require a small model intelligence, require tremendous speed, is ok to run on a cloud due to power requirements, AND will be used for years without change since the model is etched into silicon?

reply
machiaweliczny
4 minutes ago
[-]
A lot of NLP tasks could benefit from this
reply
pjc50
1 hour ago
[-]
Where are those numbers from? It's not immediately clear to me that you can distribute one model across chips with this design.

> Model is etched onto the silicon chip. So can’t change anything about the model after the chip has been designed and manufactured.

Subtle detail here: the fastest turnaround that one could reasonably expect on that process is about six months. This might eventually be useful, but at the moment it seems like the model churn is huge and people insist you use this week's model for best results.

reply
aurareturn
1 hour ago
[-]

  > The first generation HC1 chip is implemented in the 6 nanometer N6 process from TSMC. Each HC1 chip has 53 billion transistors on the package, most of it very likely for ROM and SRAM memory. The HC1 card burns about 200 watts, says Bajic, and a two-socket X86 server with ten HC1 cards in it runs 2,500 watts.
https://www.nextplatform.com/2026/02/19/taalas-etches-ai-mod...
reply
darkwater
32 minutes ago
[-]
And what of that makes you assume that having a server with 10 HC1 cards is needed to run a single model version on that server?
reply
dakolli
1 hour ago
[-]
So it lights money on fire extra fast, AI focused VCs are going to really love it then!!
reply
adityashankar
1 hour ago
[-]
This depends on how much better the models will get from now in, if Claude Opus 4.6 was transformed into one of these chips and ran at a hypothetical 17k tokens/second, I'm sure that would be astounding, this depends on how much better claude Opus 5 would be compared to the current generation
reply
aurareturn
1 hour ago
[-]
I’m pretty sure they’d need a small data center to run a model the size of Opus.
reply
danpalmer
1 hour ago
[-]
Alternatively, you could run far more RAG and thinking to integrate recent knowledge, I would imagine models designed for this putting less emphasis on world knowledge and more on agentic search.
reply
freeone3000
57 minutes ago
[-]
Maybe; models with more embedded associations are also better at search. (Intuitively, this tracks; a model with no world knowledge has no awareness of synonyms or relations (a pure markov model), so the more knowledge a model has, the better it can search.) It’s not clear if it’s possible to build such a model, since there doesn’t seem to be a scaling cliff.
reply
Shaanveer
1 hour ago
[-]
ceo
reply
charcircuit
1 hour ago
[-]
No one would never give such a weak model that much power over a company.
reply
teaearlgraycold
1 hour ago
[-]
I'm thinking the best end result would come from custom-built models. An 8 billion parameter generalized model will run really quickly while not being particularly good at anything. But the same parameter count dedicated to parsing emails, RAG summarization, or some other specialized task could be more than good enough while also running at crazy speeds.
reply
thrance
1 hour ago
[-]
> What is a task that is extremely high value, only require a small model intelligence, require tremendous speed, is ok to run on a cloud due to power requirements, AND will be used for years without change since the model is etched into silicon?

Video game NPCs?

reply
aurareturn
1 hour ago
[-]
Doesn’t pass the high value and require tremendous speed tests.
reply
ThePhysicist
5 minutes ago
[-]
This is really cool! I am trying to find a way to accelerate LLM inference for PII detection purposes, where speed is really necessary as we want to process millions of log lines per minute, I am wondering how fast we could get e.g. llama 3.1 to run on a conventional NVIDIA card? 10k tokens per second would be fantastic but even at 1k this would be very useful.
reply
andai
13 minutes ago
[-]
>Founded 2.5 years ago, Taalas developed a platform for transforming any AI model into custom silicon. From the moment a previously unseen model is received, it can be realized in hardware in only two months.

So this is very cool. Though I'm not sure how the economics work out? 2 months is a long time in the model space. Although for many tasks, the models are now "good enough", especially when you put them in a "keep trying until it works" loop and run them at high inference speed.

Seems like a chip would only be good for a few months though, they'd have to be upgrading them on a regular basis.

Unless model growth plateaus, or we exceed "good enough" for the relevant tasks, or both. The latter part seems quite likely, at least for certain types of work.

On that note I've shifted my focus from "best model" to "fastest/cheapest model that can do the job". For example testing Gemini Flash against Gemini Pro for simple tasks, they both complete the task fine, but Flash does it 3x cheaper and 3x faster. (Also had good results with Grok Fast in that category of bite-sized "realtime" workflows.)

reply
trentnix
1 hour ago
[-]
The speed of the chatbot's response is startling when you're used to the simulated fast typing of ChatGPT and others. But the Llama 3.1 8B model Taalas uses predictably results in incorrect answers, hallucinations, poor reliability as a chatbot.

What type of latency-sensitive applications are appropriate for a small-model, high-throughput solution like this? I presume this type of specialization is necessary for robotics, drones, or industrial automation. What else?

reply
freakynit
38 minutes ago
[-]
You could build realtime API routing and orchestration systems that rely on high quality language understanding but need near-instant responses. Examples:

1. Intent based API gateways: convert natural language queries into structured API calls in real time (eg., "cancel my last order and refund it to the original payment method" -> authentication, order lookup, cancellation, refund API chain).

2. Of course, realtime voice chat.. kinda like you see in movies.

3. Security and fraud triage systems: parse logs without hardcoded regexes and issue alerts and full user reports in real time and decide which automated workflows to trigger.

4. Highly interactive what-if scenarios powered by natural language queries.

This effectively gives you database level speeds on top of natural language understanding.

reply
zardo
21 minutes ago
[-]
I'm wondering how much the output quality of a small model could be boosted by taking multiple goes at it. Generate 20 answers and feed them back through with a "rank these responses" prompt. Or doing something like MCTS.
reply
app13
1 hour ago
[-]
Routing in agent pipelines is another use. "Does user prompt A make sense with document type A?" If yes, continue, if no, escalate. That sort of thing
reply
freeone3000
55 minutes ago
[-]
Maybe summarization? I’d still worry about accuracy but smaller models do quite well.
reply
soleveloper
7 minutes ago
[-]
There are so many use cases for small and super fast models that are already in size capacity -

* Many top quality tts and stt models

* Image recognition, object tracking

* speculative decoding, attached to a much bigger model (big/small architecture?)

* agentic loop trying 20 different approaches / algorithms, and then picking the best one

reply
metabrew
1 hour ago
[-]
I tried the chatbot. jarring to see a large response come back instantly at over 15k tok/sec

I'll take one with a frontier model please, for my local coding and home ai needs..

reply
stabbles
1 hour ago
[-]
Reminds me of that solution to Fermi's paradox, that we don't detect signals from extraterrestrial civilizations because they run on a different clock speed.
reply
dintech
1 hour ago
[-]
Iain M Banks’ The Algebraist does a great job of covering that territory. If an organism had a lifespan of millions of years, they might perceive time and communication differently to say a house fly or us.
reply
xyzsparetimexyz
1 hour ago
[-]
:eyeroll:
reply
grzracz
1 hour ago
[-]
Absolute insanity to see a coherent text block that takes at least 2 minutes to read generated in a fraction of a second. Crazy stuff...
reply
VMG
35 minutes ago
[-]
Not at all if you consider the internet pre-LLM. That is the standard expectation when you load a website.

The slow word-by-word typing was what we started to get used to with LLMs.

If these techniques get widespread, we may grow accustomed to the "old" speed again where content loads ~instantly.

Imagine a content forest like Wikipedia instantly generated like a Minecraft word...

reply
pjc50
1 hour ago
[-]
Accelerating the end of the usable text-based internet one chip at a time.
reply
kleiba
1 hour ago
[-]
Yes, but the quality of the output leaves to be desired. I just asked about some sports history and got a mix of correct information and totally made up nonsense. Not unexpected for an 8k model, but raises the question of what the use case is for such small models.
reply
kgeist
23 minutes ago
[-]
8b models are great at converting unstructured data to a structured format. Say, you want to transcribe all your customer calls and get a list of issues they discussed most often. Currently with the larger models it takes me hours.

A chatbot which tells you various fun facts is not the only use case for LLMs. They're language models first and foremost, so they're good at language processing tasks (where they don't "hallucinate" as much).

Their ability to memorize various facts (with some "hallucinations") is an interesting side effect which is now abused to make them into "AI agents" and what not but they're just general-purpose language processing machines at their core.

reply
djb_hackernews
1 hour ago
[-]
You have a misunderstanding of what LLMs are good at.
reply
cap11235
1 hour ago
[-]
Poster wants it to play Jeopardy, not process text.
reply
kleiba
49 minutes ago
[-]
Care to enlighten me?
reply
vntok
32 minutes ago
[-]
Don't ask a small LLM about precise minutiae factual information.

Alternatively, ask yourself how plausible it sounds that all the facts in the world could be compressed into 8k parameters while remaining intact and fine-grained. If your answer is that it sounds pretty impossible... well it is.

reply
paganel
1 hour ago
[-]
Not sure if you're correct, as the market is betting trillions of dollars on these LLMs, hoping that they'll be close to what the OP had expected to happen in this case.
reply
IshKebab
56 minutes ago
[-]
I don't think he does. Larger models are definitely better at not hallucinating. Enough that they are good at answering questions on popular topics.

Smaller models, not so much.

reply
gchadwick
24 minutes ago
[-]
This is an interesting piece of hardware though when they go multi-chip for larger models the speed will no doubt suffer.

They'll also be severely limited on context length as it needs to sit in SRAM. Looks like the current one tops out at 6144 tokens which I presume is a whole chips worth. You'd also have to dedicate a chip to a whole user as there's likely only enough SRAM for one user's worth of context. I wonder how much time it takes them to swap users in/out? I wouldn't be surprised if this chip is severely underutilized (can't use it all when running decode as you have to run token by token with one users and then idle time as you swap users in/out).

Maybe a more realistic deployment would have chips for linear layers and chips for attention? You could batch users through the shared weight chips and then provision more or less attention chips as you want which would be per user (or shared amongst a small group 2-4 users).

reply
pelasaco
2 minutes ago
[-]
Is that already ready to sell or is the new "ASIC Miner" pay now and get it later? Sorry to be that skeptical, but AI is the new "crypto coin" and the Crypto Bros still around...
reply
saivishwak
12 minutes ago
[-]
But as models are changing rapidly and new architectures coming up, how do they scale and also we do t yet know the current transformer architecture will scale more than it already is. Soo many ope questions but VCs seems to be pouring money.
reply
33a
16 minutes ago
[-]
If they made a low power/mobile version, this could be really huge for embedded electronics. Mass produced, highly efficient "good enough" but still sort of dumb ais could put intelligence in house hold devices like toasters, light switches, and toilets. Truly we could be entering into the golden age of curses.
reply
grzracz
1 hour ago
[-]
This would be killer for exploring simultaneous thinking paths and council-style decision taking. Even with Qwen3-Coder-Next 80B if you could achieve a 10x speed, I'd buy one of those today. Can't wait to see if this is still possible with larger models than 8B.
reply
aurareturn
1 hour ago
[-]
It uses 10 chips for 8B model. It’d need 80 chips for an 80b model.

Each chip is the size of an H100.

So 80 H100 to run at this speed. Can’t change the model after you manufacture the chips since it’s etched into silicon.

reply
9cb14c1ec0
37 minutes ago
[-]
As many others in this conversation have asked, can we have some sources on the idea that the model is spread across chips? You keep making the claim, but no one (myself included) else has any idea where that information comes from or if it is correct.
reply
aurareturn
18 minutes ago
[-]
I was indeed wrong about 10 chips. I thought they would use llama 8B 16bit and a few thousand context size. It turns out, they used llama 8B 3bit with only 1k context size. That made me assume they must have chained multiple chips together since the max SRAM on TSMC n6 for reticle sized chip is only around 3GB.
reply
grzracz
1 hour ago
[-]
I'm sure there is plenty of optimization paths left for them if they're a startup. And imho smaller models will keep getting better. And a great business model for people having to buy your chips for each new LLM release :)
reply
aurareturn
1 hour ago
[-]
One more thing. It seems like this is a Q3 quant. So only 3GB RAM requirement.

10 H100 chips for 3GB model.

I think it’s a niche of a niche at this point.

I’m not sure what optimization they can do since a transistor is a transistor.

reply
ubercore
1 hour ago
[-]
Do we know that it needs 10 chips to run the model? Or are the servers for the API and chatbot just specced with 10 boards to distribute user load?
reply
FieryTransition
1 hour ago
[-]
If you etch the bits into silicon, you then have to accommodate the bits by physical area, which is the transistor density for whatever modern process they use. This will give you a lower bound for the size of the wafers.
reply
aetherspawn
1 hour ago
[-]
This is what’s gonna be in the brain of the robot that ends the world.

The sheer speed of how fast this thing can “think” is insanity.

reply
FieryTransition
1 hour ago
[-]
If it's not reprogrammable, it's just expensive glass.

If you etch the bits into silicon, you then have to accommodate the bits by physical area, which is the transistor density for whatever modern process they use. This will give you a lower bound for the size of the wafers.

This can give huge wafers for a very set model which is old by the time it is finalized.

Etching generic functions used in ML and common fused kernels would seem much more viable as they could be used as building blocks.

reply
audunw
15 minutes ago
[-]
Models don’t get old as fast as they used to. A lot of the improvements seem to go into making the models more efficient, or the infrastructure around the models. If newer models mainly compete on efficiency it means you can run older models for longer on more efficient hardware while staying competitive.

If power costs are significantly lower, they can pay for themselves by the time they are outdated. It also means you can run more instances of a model in one datacenter, and that seems to be a big challenge these days: simply building an enough data centres and getting power to them. (See the ridiculous plans for building data centres in space)

A huge part of the cost with making chips is the masks. The transistor masks are expensive. Metal masks less so.

I figure they will eventually freeze the transistor layer and use metal masks to reconfigure the chips when the new models come out. That should further lower costs.

I don’t really know if this makes sanse. Depends on whether we get new breakthroughs in LLM architecture or not. It’s a gamble essentially. But honestly, so is buying nvidia blackwell chips for inference. I could see them getting uneconomical very quickly if any of the alternative inference optimised hardware pans out

reply
MagicMoonlight
24 minutes ago
[-]
You don’t need it to be reprogrammable if it can use tools and RAG.
reply
est31
1 hour ago
[-]
I wonder if this makes the frontier labs abandon the SAAS per-token pricing concept for their newest models, and we'll be seeing non-open-but-on-chip-only models instead, sold by the chip and not by the token.

It could give a boost to the industry of electron microscopy analysis as the frontier model creators could be interested in extracting the weights of their competitors.

The high speed of model evolution has interesting consequences on how often batches and masks are cycled. Probably we'll see some pressure on chip manufacturers to create masks more quickly, which can lead to faster hardware cycles. Probably with some compromises, i.e. all of the util stuff around the chip would be static, only the weights part would change. They might in fact pre-make masks that only have the weights missing, for even faster iteration speed.

reply
mips_avatar
1 hour ago
[-]
I think the thing that makes 8b sized models interesting is the ability to train unique custom domain knowledge intelligence and this is the opposite of that. Like if you could deploy any 8b sized model on it and be this fast that would be super interesting, but being stuck with llama3 8b isn't that interesting.
reply
ACCount37
1 hour ago
[-]
The "small model with unique custom domain knowledge" approach has a very low capability ceiling.

Model intelligence is, in many ways, a function of model size. A small model well fit for a given domain is still crippled by being small.

Some things don't benefit from general intelligence much. Sometimes a dumb narrow specialist really is all you need for your tasks. But building that small specialized model isn't easy or cheap.

Engineering isn't free, models tend to grow obsolete as the price/capability frontier advances, and AI specialists are less of a commodity than AI inference is. I'm inclined to bet against approaches like this on a principle.

reply
8cvor6j844qw_d6
32 minutes ago
[-]
Amazing speed. Imagine if its standardised like the GPU card equivalent in the future.

New models come out, time to upgrade your AI card, etc.

reply
japoneris
57 minutes ago
[-]
I am super happy to see people working on hardware for local llm. Yet, isnt it premature ? Space is still evolving. Today, i refuse to buy a gpu because i do not know what will be the best model tomorrow. Waiting to get a on the shelf device to run an opus like model
reply
shevy-java
54 minutes ago
[-]
"Many believe AI is the real deal. In narrow domains, it already surpasses human performance. Used well, it is an unprecedented amplifier of human ingenuity and productivity."

Sounds like people drinking the Kool-Aid now.

I don't reject that AI has use cases. But I do reject that it is promoted as "unprecedented amplifier" of human xyz anything. These folks would even claim how AI improves human creativity. Well, has this been the case?

reply
rrr_oh_man
53 minutes ago
[-]
> These folks would even claim how AI improves human creativity. Well, has this been the case?

Yes. Example: If you've never programmed in language X, but want to build something in it, you can focus on getting from 0 to 1 instead of being bogged down in the idiosyncrasies of said language.

reply
faeyanpiraat
53 minutes ago
[-]
For me, this is entirely true.

I'm progressing with my side projects like I've never before.

reply
small_model
37 minutes ago
[-]
Same, I would have given up on them long ago, I no longer code at all now. Why would I when the latest models can do it better, faster and without the human limitations of tiredness, emotional impacts etc.
reply
Mizza
1 hour ago
[-]
This is pretty wild! Only Llama3.1-8B, but this is only their first release so you can assume they're working on larger versions.

So what's the use case for an extremely fast small model? Structuring vast amounts of unstructured data, maybe? Put it in a little service droid so it doesn't need the cloud?

reply
btbuildem
41 minutes ago
[-]
This is impressive. If you can scale it to larger models, and somehow make the ROM writeable, wow, you win the game.
reply
GaggiX
9 minutes ago
[-]
For fun I'm imagining a future where you would be able to buy an ASIC with like an hard-wired 1B LLM model in it for cents and it could be used everywhere.
reply
dakolli
1 hour ago
[-]
try here, I hate llms but this is crazy fast. https://chatjimmy.ai/
reply
bmacho
1 hour ago
[-]

  "447 / 6144 tokens"
  "Generated in 0.026s • 15,718 tok/s"
This is crazy fast. I always predicted this speed in ~2 years in the future, but it's here, now.
reply
Lalabadie
1 hour ago
[-]
The full answer pops in milliseconds, it's impressive and feels like a completely different technology just by foregoing the need to stream the output.
reply
FergusArgyll
1 hour ago
[-]
Because most models today generate slowish, they give the impression of someone typing on the other end. This is just <enter> -> wall of text. Wild
reply
stuxf
1 hour ago
[-]
I totally buy the thesis on specialization here, I think it makes total sense.

Asides from the obvious concern that this is a tiny 8B model, I'm also a bit skeptical of the power draw. 2.4 kW feels a little bit high, but someone else should try doing the napkin math compared to the total throughput to power ratio on the H200 and other chips.

reply
dagi3d
1 hour ago
[-]
wonder if at some point you could swap the model as if you were replacing a cpu in your pc or inserting a game cartridge
reply
hbbio
1 hour ago
[-]
Strange that they apparently raised $169M (really?) and the website looks like this. Don't get me wrong: Plain HTML would do if "perfect", or you would expect something heavily designed. But script-kiddie vibe coded seems off.

The idea is good though and could work.

reply
ACCount37
1 hour ago
[-]
Strange that they raised money at all with an idea like this.

It's a bad idea that can't work well. Not while the field is advancing the way it is.

Manufacturing silicon is a long pipeline - and in the world of AI, one year of capability gap isn't something you can afford. You build a SOTA model into your chips, and by the time you get those chips, it's outperformed at its tasks by open weights models half their size.

Now, if AI advances somehow ground to a screeching halt, with model upgrades coming out every 4 years, not every 4 months? Maybe it'll be viable. As is, it's a waste of silicon.

reply
xav_authentique
7 minutes ago
[-]
maybe they're betting on improvement in models to plateau, and that having a fairly stablized capable model that is orders of magnitude faster than running on GPU's can be valuable in the future?
reply
small_model
1 hour ago
[-]
Poverty of imagination here, plenty uses of this and its a prototype at this stage.
reply
ACCount37
52 minutes ago
[-]
What uses, exactly?

The prototype is: silicon with a Llama 3.1 8B etched into it. Today's 4B models already outperform it.

Token rate in five digits is a major technical flex, but, does anyone really need to run a very dumb model at this speed?

The only things that come to mind that could reap a benefit are: asymmetric exotics like VLA action policies and voice stages for V2V models. Both of which are "small fast low latency model backed by a large smart model", and both depend on model to model comms, which this doesn't demonstrate.

In a way, it's an I/O accelerator rather than an inference engine. At best.

reply
leoedin
30 minutes ago
[-]
Even if this first generation is not useful, the learning and architecture decisions in this generation will be. You really can't think of any value to having a chip which can run LLMs at high speed and locally for 1/10 of the energy budget and (presumably) significantly lower cost than a GPU?

If you look at any development in computing, ASICs are the next step. It seems almost inevitable. Yes, it will always trail behind state of the art. But value will come quickly in a few generations.

reply
MITSardine
28 minutes ago
[-]
With LLMs this fast, you could imagine using them as any old function in programs.
reply
gozucito
1 hour ago
[-]
Can it scale to an 800 billion param model? 8B parameter models are too far behind the frontier to be useful to me for SWE work.

Or is that the catch? Either way I am sure there will be some niche uses for it.

reply
taneq
1 hour ago
[-]
Spam. :P
reply
Lionga
59 minutes ago
[-]
so 90% of the AI market?
reply
impossiblefork
1 hour ago
[-]
So I'm guessing this is some kind of weights as ROM type of thing? At least that's how I interpret the product page, or maybe even a sort of ROM type thing that you can only access by doing matrix multiplies.
reply
readitalready
1 hour ago
[-]
You shouldn't need any ROM. It's likely the architecture is just fixed hardware with weights loaded in via scan flip-flows. If it was me making it, I'd just design a systolic array. Just multipliers feeding into multipliers, without even going through RAM.
reply
retrac98
1 hour ago
[-]
Wow. I’m finding it hard to even conceive of what it’d be like to have one of the frontier models on hardware at this speed.
reply
Havoc
1 hour ago
[-]
That seems promising for applications that require raw speed. Wonder how much they can scale it up - 8B model quantized is very usable but still quite small compared to even bottom end cloud models.
reply
Adexintart
1 hour ago
[-]
The token throughput improvements are impressive. This has direct implications for usage-based billing in AI products — faster inference means lower cost per request, which changes the economics of credits-based pricing models significantly.
reply
Dave3of5
1 hour ago
[-]
Fast but the output is shit due to the contrained model they used. Doubt we'll ever get something like this for the large Param decent models.
reply
stego-tech
1 hour ago
[-]
I still believe this is the right - and inevitable - path for AI, especially as I use more premium AI tooling and evaluate its utility (I’m still a societal doomer on it, but even I gotta admit its coding abilities are incredible to behold, albeit lacking in quality).

Everyone in Capital wants the perpetual rent-extraction model of API calls and subscription fees, which makes sense given how well it worked in the SaaS boom. However, as Taalas points out, new innovations often scale in consumption closer to the point of service rather than monopolized centers, and I expect AI to be no different. When it’s being used sparsely for odd prompts or agentically to produce larger outputs, having local (or near-local) inferencing is the inevitable end goal: if a model like Qwen or Llama can output something similar to Opus or Codex running on an affordable accelerator at home or in the office server, then why bother with the subscription fees or API bills? That compounds when technical folks (hi!) point out that any process done agentically can instead just be output as software for infinite repetition in lieu of subscriptions and maintained indefinitely by existing technical talent and the same accelerator you bought with CapEx, rather than a fleet of pricey AI seats with OpEx.

The big push seems to be building processes dependent upon recurring revenue streams, but I’m gradually seeing more and more folks work the slop machines for the output they want and then put it away or cancel their sub. I think Taalas - conceptually, anyway - is on to something.

reply
Bengalilol
54 minutes ago
[-]
Does anyone have an idea how much such a component costs?
reply
kanodiaayush
1 hour ago
[-]
I'm loving summarization of articles using their chatbot! Wow!
reply
danielovichdk
55 minutes ago
[-]
Is this hardware for sale ? The site doesn't say.
reply
niek_pas
1 hour ago
[-]
> Though society seems poised to build a dystopian future defined by data centers and adjacent power plants, history hints at a different direction. Past technological revolutions often started with grotesque prototypes, only to be eclipsed by breakthroughs yielding more practical outcomes.

…for a privileged minority, yes, and to the detriment of billions of people whose names the history books conveniently forget. AI, like past technological revolutions, is a force multiplier for both productivity and exploitation.

reply
clbrmbr
1 hour ago
[-]
What would it take to put Opus on a chip? Can it be done? What’s the minimum size?
reply
loufe
1 hour ago
[-]
Jarring to see these other comments so blindly positive.

Show me something at a model size 80GB+ or this feels like "positive results in mice"

reply
viraptor
1 hour ago
[-]
There are a lot of problems solved by tiny models. The huge ones are fun for large programming tasks, exploration, analysis, etc. but there's a massive amount of processing <10GB happening every day. Including on portable devices.

This is great even if it can't ever run Opus. Many people will be extremely happy about something like Phi accessible at lightning speed.

reply
hkt
1 hour ago
[-]
Positive results in mice also known as being a promising proof of concept. At this point, anything which deflates the enormous bubble around GPUs, memory, etc, is a welcome remedy. A decent amount of efficient, "good enough" AI will change the market very considerably, adding a segment for people who don't need frontier models. I'd be surprised if they didn't end up releasing something a lot bigger than they have.
reply
hkt
1 hour ago
[-]
Reminds me of when bitcoin started running on ASICs. This will always lag behind the state of the art, but incredibly fast, (presumably) power efficient LLMs will be great to see. I sincerely hope they opt for a path of selling products rather than cloud services in the long run, though.
reply
dsign
1 hour ago
[-]
This is like microcontrollers, but for AI? Awesome! I want one for my electric guitar; and please add an AI TTS module...
reply
brazzy
1 hour ago
[-]
No, it's ASICs, but for AI.
reply
MagicMoonlight
23 minutes ago
[-]
Jesus, it just generated a story in 0.039s.

Whoever doesn’t buy/replicate this in the next year is dead. Imagine OpenAI trying to sell you a platform that takes 15 minutes, when someone else can do it in 0.001s.

reply
PrimaryExplorer
20 minutes ago
[-]
this is absolutely mindblowing speed. imagine this with opus or 5.2
reply
baq
2 hours ago
[-]
one step closer to being able to purchase a box of llms on aliexpress, though 1.7ktok/s would be quite enough
reply
bloggie
1 hour ago
[-]
I wonder if this is the first step towards AI as an appliance rather than a subscription?
reply
hxugufjfjf
1 hour ago
[-]
It was so fast that I didn't realise it had sent its response. Damn.
reply
rotbart
1 hour ago
[-]
Hurrah, its dumb answer to the now classic "the car wash is 100m away, should I drive or walk?" appeared very quickly.
reply
Lalabadie
1 hour ago
[-]
It's an 8B parameter model from a good while ago, what were your expectations?
reply
moralestapia
1 hour ago
[-]
Wow, this is great.

To the authors: do not self-deprecate your work. It is true this is not a frontier model (anymore) but the tech you've built is truly impressive. Very few hardware startups have a v1 as good as this one!

Also, for many tasks I can think of, you don't really need the best of the best of the best, cheap and instant inference is a major selling point in itself.

reply
raincole
1 hour ago
[-]
It's crazily fast. But 8B model is pretty much useless.

Anyway VCs will dump money onto them, and we'll see if the approach can scale to bigger models soon.

reply
YetAnotherNick
1 hour ago
[-]
17k token/sec is $0.18/chip/hr for the size of H100 chip if they want to compete with the market rate[1]. But 17k token/sec could lead to some new usecases.

[1]: https://artificialanalysis.ai/models/llama-3-1-instruct-8b/p...

reply
notenlish
2 hours ago
[-]
Impressive stuff.
reply
small_model
1 hour ago
[-]
Scale this then close the loop and have fabs spit out new chips with latest weights every week that get placed in a server using a robot, how long before AGI?
reply
viftodi
1 hour ago
[-]
I tried the trick question I saw here before, about the make 1000 with 9 8s and additions only

I know it's not a resonating model, but I keep pushing it and eventually it gave me this as part of it's output

888 + 88 + 88 + 8 + 8 = 1060, too high... 8888 + 8 = 10000, too high... 888 + 8 + 8 +ประก 8 = 1000,ประก

I googled the strange symbol, it seems to mean Set in thai?

reply
danpalmer
1 hour ago
[-]
I don't think it's very valuable to talk about the model here, the model is just an old Llama. It's the hardware that matters.
reply
fragkakis
1 hour ago
[-]
The article doesn't say anything about the price (it will be expensive), but it doesn't look like something that the average developer would purchase.

An LLM's effective lifespan is a few months (ie the amount of time it is considered top-tier), it wouldn't make sense for a user to purchase something that would be superseded in a couple of months.

An LLM hosting service however, where it would operate 24/7, would be able to make up for the investment.

reply