FilterHN

thesz

4 hours ago

[-]

8B coefficients are packed into 53B transistors, 6.5 transistors per coefficient. Two-inputs NAND gate takes 4 transistors and register takes about the same. One coefficient gets processed (multiplied by and result added to a sum) with less than two two-inputs NAND gates.

I think they used block quantization: one can enumerate all possible blocks for all (sorted) permutations of coefficients and for each layer place only these blocks that are needed there. For 3-bit coefficients and block size of 4 coefficients only 330 different blocks are needed.

Matrices in the llama 3.1 are 4096x4096, 16M coefficients. They can be compressed into only 330 blocks, if we assume that all coefficients' permutations are there, and network of correct permutations of inputs and outputs.

Assuming that blocks are the most area consuming part, we have block's transistor budget of about 250 thousands of transistors, or 30 thousands of 2-inputs NAND gates per block.

250K transistors per block * 330 blocks / 16M transistors = about 5 transistors per coefficient.

Looks very, very doable.

It does look doable even for FP4 - these are 3-bit coefficients in disguise.

1 hour ago

[-]

I'm looking forward to the model.toVHDL() method in PyTorch.

bsenftner

13 minutes ago

[-]

I'm surprised people are surprised. Of course this is possible, and of course this is the future. This has been demonstrated already: why do you think we even have GPUs at all?! Because we did this exact same transition from running in software to largely running in hardware for all 2D and 3D Computer Graphics. And these LLMs are practically the same math, it's all just obvious and inevitable, if you're paying attention to what we have, what we do to have what we have.

[1] https://en.wikipedia.org/wiki/Structured_ASIC_platform

Hello9999901

6 hours ago

[-]

This would be a very interesting future. I can imagine Gemma 5 Mini running locally on hardware, or a hard-coded "AI core" like an ALU or media processor that supports particular encoding mechanisms like H.264, AV1, etc.

Other than the obvious costs (but Taalas seems to be bringing back the structured ASIC era so costs shouldn't be that low [1]), I'm curious why this isn't getting much attention from larger companies. Of course, this wouldn't be useful for training models but as the models further improve, I can totally see this inside fully local + ultrafast + ultra efficient processors.

roncesvalles

5 hours ago

[-]

Well even programmable ASICs like Cerebras and Groq give many-multiples speedup over GPUs and the market has hardly reacted at all.

fooker

3 hours ago

[-]

> market has hardly reacted at all

Guess who acqui-hired Groq to push this into GPUs?

The name GPU has been an anachronism for a couple of years now.

3 hours ago

[-]

Seems both Nvidia (Groq) and OpenAI (Codex Spark) are now invested in the ASIC route one way or another.

owenpalmer

6 hours ago

[-]

> Kinda like a CD-ROM/Game cartridge, or a printed book, it only holds one model and cannot be rewritten.

Imagine a slot on your computer where you physically pop out and replace the chip with different models, sort of like a Nintendo DS.

roncesvalles

5 hours ago

[-]

That slot is called USB-C. I can fully imagine inference ASICs coming in powerbank form factor that you'd just plug and play.

zupa-hu

3 hours ago

[-]

This would be a hell of a hot power bank. It uses about as much power as my oven. So probably more like inside a huge cooling device outside the house. Or integrated into the heating system of the house.

(Still compelling!)

fennecbutt

2 hours ago

[-]

*the whole server uses 2.2kw or whatever, not a single board. I think that was for 8 boards or something.

1 hour ago

[-]

> USB-C

With these speeds you can run it over USB2, though maybe power is limiting.

Hendrikto

57 minutes ago

[-]

USB-C is just a form factor and has nothing to do with which protocol you run at which speeds.

ekianjo

3 hours ago

[-]

Not if you need 200w power to run inference.

stavros

2 hours ago

[-]

USB-C can do up to 240W. These days I power all my devices with a USB hub, even my Lipo charger.

XorNot

5 hours ago

[-]

Pretty sure it'd just be a thumbdrive. Are the Taalas chips particularly large in surface area?

[0] https://taalas.com/the-path-to-ubiquitous-ai/

dmurray

4 hours ago

[-]

The only product they've announced at the moment [0] is a PCI-e card. It's more like a small power bank than a big thumb drive.

But sure, the next generation could be much smaller. It doesn't require battery cells, (much) heat management, or ruggedization, all of which put hard limits on how much you can miniaturise power banks.

ChrisMarshallNY

2 hours ago

[-]

I’m old enough to remember your typical computer filling warehouse-sized buildings.

Nowadays, your average cellphone has more computing power than those behemoths.

I have a micro SD card with 256GB capacity, and I think they are up to 2TB. On a device the size of a fingernail.

thesz

4 hours ago

[-]

800 mm2, about 90mm per side, if imagined as a square. Also, 250 W of power consumption.

The form factor should be anything but thumbdrive.

pfortuny

4 hours ago

[-]

mmmhhhhh 800mm2 ~= (30mm)2, which is more like a (biggish) thumb drive.

thesz

4 hours ago

[-]

Thanks!

I haven't had my coffee yet. ;)

baq

1 hour ago

[-]

the radiator wouldn't be though

6 hours ago

[-]

That's the kind of hardware am rooting for. Since it'll encourage Open weighs models, and would be much more private.

Infact, I was thinking, if robots of future could have such slots, where they can use different models, depending on the task they're given. Like a Hardware MoE.

NitpickLawyer

2 hours ago

[-]

> Since it'll encourage Open weighs models

Is this accurate? I don't know enough about hardware, but perhaps someone could clarify: how hard would it be to reverse engineer this to "leak" the model weights? Is it even possible?

There are some labs that sell access to their models (mistral, cohere, etc) without having their models open. I could see a world where more companies can do this if this turns out to be a viable way. Even to end customers, if reverse engineering is deemed impossible. You could have a device that does most of the inference locally and only "call home" when stumped (think alexa with local processing for intent detection and cloud processing for the rest, but better).

Someone

3 hours ago

[-]

Would somewhat work except for the power usage.

I doubt it would scale linearly, but for home use 170 tokens/s at 2.5W would be cool; 17 tokens/s at 0,25W would be awesome.

On the other hand, this may be a step towards positronic brains (https://en.wikipedia.org/wiki/Positronic_brain)

kilroy123

2 hours ago

[-]

This is what I've been wanting! Just like those eGPUs you would plug into your Mac. You would have a big model or device capable of running a top-tier model under your desk. All local, completely private.

8cvor6j844qw_d6

6 hours ago

[-]

A cartridge slot for models is a fun idea. Instead of one chip running any model, you get one model or maybe a family of models per chip at (I assume) much better perf/watt. Curious whether the economics work out for consumer use or if this stays in the embedded/edge space.

sixtyj

4 hours ago

[-]

Plug it into skull bone. Neuralink + slot for a model that you can buy in s grocery store instead of prepaid Netflix card.

Onavo

5 hours ago

[-]

Yeah maybe you can call it PCIe.

MarcLore

1 hour ago

[-]

The form factor discussion is fascinating but I think the real unlock is latency. Current cloud inference adds 50-200ms of network overhead before you even start generating tokens. A dedicated ASIC sitting on PCIe could serve first token in microseconds.

For applications like real-time video generation or interactive agents that need sub-100ms response loops, that difference is everything. The cost per inference might be higher than a GPU cluster at scale, but the latency profile opens up use cases that simply aren't possible with current architectures.

Curious whether Taalas has published any latency benchmarks beyond the throughput numbers.

muyuu

55 seconds ago

[-]

latency and control, and reliability of bandwidth and associated costs

there are tasks that inherently benefit from being centralised away, like say coordination of peers across a large area - and there are tasks that strongly benefit from being as close to the user as possible, like low latency tasks and privacy/control-centred tasks

simultaneously, there's an overlapping pull to either side caused by the monetary interests of corporations vs users - corporations want as much as possible under their control, esp. when it's monetisable information but most things are at volume, and users want to be the sole controller of products esp. when they pay for them

we had dumb terminals already being pushed in the 1960s, the "cloud", "edge computing" and all forms of consolidation vs segregation periods across the industry, it's not going to stop because there's money to be made from the inherent advantages of those models and even the industry leaders cannot prevent these advantages from getting exploited by specialist incumbents

once leaders consolidates, inevitably they seek to maximise profit and in doing so they lower the barrier for new alternatives

ultimately I think the market will never stop demanding just having your own *** computer under your control and hopefully own it, and only the removal of this option will stop this demand; while businesses will never stop trying to control your computing, and providing real advantages in exchange for that, only to enter cycles of pushing for growing profitability to the point average users keep going back and forth

cpldcpu

5 hours ago

[-]

I wonder how well this works with MoE architectures?

For dense LLMs, like llama-3.1-8B, you profit a lot from having all the weights available close to the actual multiply-accumulate hardware.

With MoE, it is rather like a memory lookup. Instead of a 1:1 pairing of MACs to stored weights, you suddenly are forced to have a large memory block next to a small MAC block. And once this mismatch becomes large enough, there is a huge gain by using a highly optimized memory process for the memory instead of mask ROM.

At that point we are back to a chiplet approach...

pests

4 hours ago

[-]

For comparison I wanted to write on how Google handles MoE archs with its TPUv4 arch.

They use Optical Circuit Switches, operating via MEMS mirrors, to create highly reconfigurable, high-bandwidth 3D torus topologies. The OCS fabric allows 4,096 chips to be connected in a single pod, with the ability to dynamically rewire the cluster to match the communication patterns of specific MoE models.

The 3D torus connects 64-chip cubes with 6 neighbors each. TPUv4 also contains 2 SparseCores which specialize handling high-bandwidth, non-contiguous memory accesses.

Of course this is a DC level system, not something on a chip for your pc, but just want to express the scale here.

*ed: SpareCubes to SparseCubes

3 hours ago

[-]

If each of the Expert models were etched in Silicon, it would still have massive speed boost, isn't it?

I feel printing ASIC is the main block here.

peteforde

18 minutes ago

[-]

I would appreciate some clarification on the "store 4 bits of data with one transistor" part.

This doesn't sound remotely possible, but I am here to be convinced.

ajb

9 minutes ago

[-]

They declined to say: https://www.eetimes.com/taalas-specializes-to-extremes-for-e...

Except they say it's fully digital, so not an analog multiplier

3 hours ago

[-]

If we can print ASIC at low cost, this will change how we work with models.

Models would be available as USB plug-in devices. A dense < 20B model may be the best assistant we need for personal use. It is like graphic cards again.

I hope lots of vendors will take note. Open weight models are abundant now. Even at a few thousand tokens/second, low buying cost and low operating cost, this is massive.

kioku

1 hour ago

[-]

I’m just wondering how this translates to computer manufacturers like Apple. Could we have these kinds of chips built directly into computers within three years? With insanely fast, local on-demand performance comparable to today’s models?

xattt

1 hour ago

[-]

Is it possible to supplement the model with a diff for updates on modular memory, or would severely impact perf?

baq

1 hour ago

[-]

this design at 7 transistors per weight is 99.9% burnt in the silicon forever.

arisAlexis

26 minutes ago

[-]

and run an outdated model for 3 years while progress is exponential? what is the point of that

r0b05

8 minutes ago

[-]

Yeah, the space moves so quickly that I would not want to couple the hardware with a model that might be outdated in a month. There are some interesting talking points but a general purpose programmable asic makes more sense to me.

RobertDeNiro

9 minutes ago

[-]

It won’t stay exponential forever.

coppsilgold

2 hours ago

[-]

How feasible would it be to integrate a neural video codec into the SoC/GPU silicon?

There would be model size constraints and what quality they can achieve under those constraints.

Would be interesting if it didn't make sense to develop traditional video codecs anymore.

The current video<->latents networks (part of the generative AI model for video) don't optimize just for compression. And you probably wouldn't want variable size input in an actual video codec anyway.

rustybolt

5 hours ago

[-]

Note that this doesn't answer the question in the title, it merely asks it.

5 hours ago

[-]

Yeah, I had written the blog to wrap my head around the idea of 'how would someone even be printing Weights on a chip?' 'Or how to even start to think in that direction?'.

I didn't explore the actual manufacturing process.

pixelmelt

5 hours ago

[-]

You should add an RSS feed so I can follow it!

5 hours ago

[-]

I don't post blogs often, so haven't added RSS there, but will do. I mostly post to my linkblog[1], hence have RSS there.

[1] https://www.anuragk.com/linkblog

alcasa

2 hours ago

[-]

Frankly the most critical question is if they can really take shortcuts on DV etc, which are the main reasons nobody else tapes out new chips for every model. Note that their current architecture only allows some LORA-Adapter based fine-tuning, even a model with an updated cutoff date would require new masks etc. Which is kind of insane, but props to them if they can make it work.

From some announcements 2 years ago, it seems like they missed their initial schedule by a year, if that's indicative of anything.

For their hardware to make sense a couple of things would need to be true: 1. A model is good enough for a given usecase that there is no need to update/change it for 3-5 years. Note they need to redo their HW-Pipeline if even the weights change. 2. This application is also highly latency-sensitive and benefits from power efficiency. 3. That application is large enough in scale to warrant doing all this instead of running on last-gen hardware.

Maybe some edge-computing and non-civilian use-cases might fit that, but given the lifespan of models, I wonder if most companies wouldn't consider something like this too high-risk.

But maybe some non-text applications, like TTS, audio/video gen, might actually be a good fit.

K0balt

16 minutes ago

[-]

TTS, speech recognition, ocr/document parsing, Vision-language-action models, vehicle control, things like that do seem to be the ideal applications. Latency constraints limit the utility of larger models in many applications.

briansm

2 hours ago

[-]

I wonder if you could use the same technique (RAM models as ROM) for something like Whisper Speech-to-text, where the models are much smaller (around a Gigabyte) for a super-efficient single-chip speech recognition solution with tons of context knowledge.

m101

3 hours ago

[-]

So if we assume this is the future, the useful life of many semiconductors will fall substantially. What part of the semiconductor supply chain would have pricing power in a world of producing many more different designs?

Perhaps mask manufacturers?

ivan_gammel

2 hours ago

[-]

It might be not that bad. “Good enough” open-weight models are almost there, the focus may shift to agentic workflows and effective prompting. The lifecycle of a model chip will be comparable to smartphones, getting longer and longer, with orchestration software being responsible for faster innovation cycles.

m101

2 hours ago

[-]

If you’re running at 17k tokens / s what is the point of multiple agents?

ivan_gammel

1 hour ago

[-]

Different skills and context. Llama 3.1 8B has just 128k context length, so packing everything in it may be not a great idea. You may want one agent analyzing the requirements and designing architecture, one writing tests, another one writing implementation and the third one doing code review. With LLMs it’s also matters not just what you have in context, but also what is absent, so that model will not overthink it.

EDIT: just in case, I define agent as inference unit with specific preloaded context, in this case, at this speed they don’t have to be async - they may run in sequence in multiple iterations.

rustyhancock

6 hours ago

[-]

Edit: reading the below it looks like I'm quite wrong here but I've left the comment...

The single transistor multiply is intriguing.

Id assume they are layers of FMA operating in the log domain.

But everything tells me that would be too noisy and error prone to work.

On the other hand my mind is completely biased to the digital world.

If they stay in the log domain and use a resistor network for multiplication, and the transistor is just exponentiating for the addition that seems genuinely ingenious.

Mulling it over, actually the noise probably doesn't matter. It'll average to 0.

It's essentially compute and memory baked together.

I don't know much about the area of research so can't tell if it's innovative but it does seem compelling!

[1] https://www.eetimes.com/taalas-specializes-to-extremes-for-e...

generuso

6 hours ago

[-]

The document referenced in the blog does not say anything about the single transistor multiply.

However, [1] provides the following description: "Taalas’ density is also helped by an innovation which stores a 4-bit model parameter and does multiplication on a single transistor, Bajic said (he declined to give further details but confirmed that compute is still fully digital)."

londons_explore

5 hours ago

[-]

It'll be different gates on the transistor for the different bits, and you power only one set depending on which bit of the result you wish to calculate.

Some would call it a multi-gate transistor, whilst others would call it multiple transistors in a row...

hagbard_c

4 hours ago

[-]

That, or a resistor ladder with 4 bit branches connected to a single gate, possibly with a capacitor in between, representing the binary state as an analogue voltage, i.e. an analogue-binary computer. If it works for flash memory it could work for this application as well.

rustyhancock

6 hours ago

[-]

That's much more informative, I think my original comment is quite off the mark then.

jsjdjrjdjdjrn

4 hours ago

[-]

I'd expect this is analog multiplication with voltage levels being ADC'd out for the bits they want. If you think about it, it makes the whole thing very analog.

jsjdjrjdjdjrn

4 hours ago

[-]

Note: reading further down, my speculation is wrong.

kinduff

5 hours ago

[-]

Very nice read, thank you for sharing this so well written.

[1] https://www.nextplatform.com/2026/02/19/taalas-etches-ai-mod...

abrichr

5 hours ago

[-]

ChatGPT Deep Research dug through Taalas' WIPO patent filings and public reporting to piece together a hypothesis. Next Platform notes at least 14 patents filed [1]. The two most relevant:

"Large Parameter Set Computation Accelerator Using Memory with Parameter Encoding" [2]

"Mask Programmable ROM Using Shared Connections" [3]

The "single transistor multiply" could be multiplication by routing, not arithmetic. Patent [2] describes an accelerator where, if weights are 4-bit (16 possible values), you pre-compute all 16 products (input x each possible value) with a shared multiplier bank, then use a hardwired mesh to route the correct result to each weight's location. The abstract says it directly: multiplier circuits produce a set of outputs, readable cells store addresses associated with parameter values, and a selection circuit picks the right output. The per-weight "readable cell" would then just be an access transistor that passes through the right pre-computed product. If that reading is correct, it's consistent with the CEO telling EE Times compute is "fully digital" [4], and explains why 4-bit matters so much: 16 multipliers to broadcast is tractable, 256 (8-bit) is not.

The same patent reportedly describes the connectivity mesh as configurable via top metal masks, referred to as "saving the model in the mask ROM of the system." If so, the base die is identical across models, with only top metal layers changing to encode weights-as-connectivity and dataflow schedule.

Patent [3] covers high-density multibit mask ROM using shared drain and gate connections with mask-programmable vias, possibly how they hit the density for 8B parameters on one 815mm2 die.

If roughly right, some testable predictions: performance very sensitive to quantization bitwidth; near-zero external memory bandwidth dependence; fine-tuning limited to what fits in the SRAM sidecar.

Caveat: the specific implementation details beyond the abstracts are based on Deep Research's analysis of the full patent texts, not my own reading, so could be off. But the abstracts and public descriptions line up well.

[2] https://patents.google.com/patent/WO2025147771A1/en

[3] https://patents.google.com/patent/WO2025217724A1/en

[4] https://www.eetimes.com/taalas-specializes-to-extremes-for-e...

generuso

4 hours ago

[-]

LSI Logic and VLSI Systems used to do such things in 1980s -- they produced a quantity of "universal" base chips, and then relatively inexpensively and quickly customized them for different uses and customers, by adding a few interconnect layers on top. Like hardwired FPGAs. Such semi-custom ASICs were much less expensive than full custom designs, and one could order them in relatively small lots.

Taalas of course builds base chips that are already closely tailored for a particular type of models. They aim to generate the final chips with the model weights baked into ROMs in two months after the weights become available. They hope that the hardware will be profitable for at least some customers, even if the model is only good enough for a year. Assuming they do get superior speed and energy efficiency, this may be a good idea.

cpldcpu

5 hours ago

[-]

It could simply be bit serial. With 4 bit weights you only need four serial addition steps, which is not an issue if the weight are stored nearby in a rom.

punnerud

4 hours ago

[-]

Could we all get bigger FPGAs and load the model onto it using the same technique?

generuso

4 hours ago

[-]

You could [1], but it is not very cheap -- the 32GB development board with the FPGA used in the article used to cost about $16K.

[1] https://arxiv.org/abs/2401.03868

fercircularbuf

4 hours ago

[-]

I thought about this exact question yesterday. Curious to know why we couldn't, if it isn't feasible. Would allow one to upgrade to the next model without fabricating all new hardware.

wmf

4 hours ago

[-]

FPGAs have really low density so that would be ridiculously inefficient, probably requiring ~100 FPGAs to load the model. You'd be better off with Groq.

menaerus

4 hours ago

[-]

Not sure what you're on but I think what you said is incorrect. You can use hi-density HBM-enabled FPGA with (LP)DDR5 with sufficient number of logic elements to implement the inference. Reason why we don't see it in action is most likely in the fact that such FPGAs are insanely expensive and not so available off-the-shelf as the GPUs are.

londons_explore

5 hours ago

[-]

So why only 30,000 tokens per second?

If the chip is designed as the article says, they should be able to do 1 token per clock cycle...

And whilst I'm sure the propagation time is long through all that logic, it should still be able to do tens of millions of tokens per second...

wmf

4 hours ago

[-]

You still need to do a forward pass per token. With massive batching and full pipelining you might be able to break the dependencies and output one token per cycle but clearly they aren't doing that.

1 hour ago

[-]

More aggressive pipelining will probably be the next step.

menaerus

4 hours ago

[-]

Reading from and to memory alone takes much more than a clock cycle.

708145_

3 hours ago

[-]

Is Taalas' approach scalable to larger models?

3 hours ago

[-]

Who's going to pay for custom chips when they shit out new models every two weeks and their deluded CEOs keep promising AGI in two release cycles?

spyder

1 hour ago

[-]

It all depends on how cheap they can get. And another interesting thought: what if you could stack them? For example you have a base model module, then new ones come out that can work together with the old ones and expanding their capabilities.

NinjaTrance

3 hours ago

[-]

To run Llama 3.1 8B locally, you would need a GPU with a minimum of 16 GB of VRAM, such as an NVIDIA RTX 3090.

Talas promises a 10x higher throughtput, being 10x cheaper and using 10x less electricity.

Looks like a good value proposition.

2 hours ago

[-]

What do you do with 8b models ? They can't even reliably create a .txt file or do any kind of tool calling

lancebeet

2 hours ago

[-]

You obviously don't believe that AGI is coming in two release cycles, and you also don't seem to have much faith in the new models containing massive improvements over the last ones. So the answer to who is going to pay for these custom chips seems to be you.

2 hours ago

[-]

Why would I buy chips to run handicapped models when the 10+ llms players all offer free tier access to their 1t+ parameters models ?

K0balt

8 minutes ago

[-]

Not all applications are chatbots. Many potential uses for LLMs/VLAMs are latency constrained.

3 hours ago

[-]

New GPUs come out all the time. New phones come out (if you count all the manufacturers) all the time. We do not need to always buy the new one.

Current open weight models < 20B are already capable of being useful. With even 1K tokens/second, they would change what it means to interact with them or for models to interact with the computer.

3 hours ago

[-]

hm yeah I guess if they stick to shitty models it works out, I was talking about the models people use to actually do things instead of shitposting from openclaw and getting reminders about their next dentist appointment.

https://github.com/brainless/dwata

2 hours ago

[-]

The trick with small models is what you ask them to do. I am working on a data extraction app (from emails and files) that works entirely local. I applied for Taalas API because it would be awesome fit.

dwata: Entirely Local Financial Data Extraction from Emails Using Ministral 3 3B with Ollama: https://youtu.be/LVT-jYlvM18

imtringued

2 hours ago

[-]

Considering that enamel regrowth is still experimental (only curodont exists as a commercial product), those dentist appointments are probably the most important routine healthcare appointments in your life. Pick something that is actually useless.

1 hour ago

[-]

I'm guessing this development will make the fabrication of custom chips cheaper.

Exciting times.

imtringued

2 hours ago

[-]

Almost all LLM companies have some sort of free tier that does nothing but lose them money.

moralestapia

5 hours ago

[-]

>HOW NVIDIA GPUs process stuff? (Inefficiency 101)

Wow. Massively ignorant take. A modern GPUs is an amazing feat of engineering, particularly about making computation more efficient (low power/high throughput).

Then proceeds to explain, wrongly, how inference is supposssedly implemented and draws conclusions from there ...

5 hours ago

[-]

Hey, Can you please point out explain the inaccuracies in the article?

I had written this post to have a higher level understanding of traditional vs Taalas's inference. So it does abstracts lots of things.

wmf

4 hours ago

[-]

Arguably DRAM-based GPUs/TPUs are quite inefficient for inference compared to SRAM-based Groq/Cerebras. GPUs are highly optimized but they still lose to different architectures that are better suited for inference.

imtringued

1 hour ago

[-]

The way modern Nvidia GPUs perform inference is that they have a processor (tensor memory accelerator) that directly performs tensor memory operations which directly concedes that GPGPU as a paradigm is too inefficient for matrix multiplication.

villgax

5 hours ago

[-]

This read itself is slop lol, literally dances around the term printing as if its some inkjet printer