For a long time, too. Programming languages rarely change much, techniques rarely change, so I should be able to use said model for I hope at least five years; and if at any time they optimize local models to cram even more intelligence into the same amount of VRAM, I can upgrade to that.
I like this path.
I run my word processing software on my apple 2 (a total joke of a computer) instead of running it on the WANG.
I run my book keeping software on visicalc instead of the IBM.
I run my simulation software on my IBM PC (I even paid for the 8087!) instead of the VAX.
Moore's law has, at least so far, allowed the pioneers with toy computers to grow their toys big enough to solve "big boy" problems after some time has allowed the toy computers to be faster and the pioneers have scaled their crappy home-grown solution to solve their 60% of the problem that was originally solved by some enormous complex system.
Eventually the toy infrastructure gets expensive and solves 90-120% of the "big iron" problem space, but it also grows to cost as much as the big iron solution.
See also http://www.catb.org/jargon/html/W/wheel-of-reincarnation.htm...
People -- WANT -- this technology on their home devices and (apparently?) the providers of this tech don't seem to be running a profit so they probably don't want the maintenance tail on their side either.
I think it's a bit different. Inevitable that this becomes a household-run thing? Not likely.
Paradoxically, the better results we get from general harness of coding agents, the less moat Claude and co. get. It's unbelievably how fast some open models outpaced frontier models of just a few months ago.
I think you've misunderstood what good enough means in the context - which is a model capable of completing the tasks assigned to it without having the breadth of full generalization. Your analogy breaks down because of this - we did get 'good enough' spec profiles for different hardware. That thing you're wearing on your wrist won't have the same specifications as the box you use to play games.
> a model capable of completing the tasks assigned to it
The thing is, the "task assigned to it" is changing with improved capabilities. If everyone around you in 2036 is using general AI to do amazing stuff, you will probably have little interest in vibe coding slop like it's 2026.
That's correct. The problem is they have smart people, tons of money, and several years to figure that out, and the best thing they can come up is a coding agent.
Sadly - it's going to be ads. Advertising is going to get in there and enshittify the whole thing because as always, advertising income is too easy and too plentiful for any company to resist.
Right now the models are fairly agnostic, but we are a hair-breadth away from ChatGPT responding with, "the right tool for this job is a circular saw - something like the Milwaulkee M18, which happens to be on sale at Home Depot this weekend."
I ended up getting a modern 26B MoE model (Gemma 4) running at reading speed on an old recycled server with a single Xeon E5-2620 v4 and 128GB of DDR3 RAM (and no GPU). It took a lot of work, but it actually worked out somehow.
I've also linked the quants at the end, but they're not gonna run unless you use the ik_llama-cpp fork I mention, see other posts for more details.
I'm not an ML engineer, so I'm by no means an expert, and the server is busy acting as a Nix cache, but if you have any question, I can try to answer, but best effort.
But ... isnt that a classic use case for SMT? Giving T1 sth. to do while T0 is waiting on DDR(3) and vise-versa?
I also dont understand the explanation of "--cpu-moe". If an expert has ~ 4.0 GiB of Parameters, why does optimizing the sequence of experts minimize cash trashing? With 20 MiB of L3 Cash vs 4.0 GiB of Parameters, it wont cash any noticeable amount of the Parameters, will it?
As mentioned by others, only some Intel Xeon E5-2xxx v4 did support DDR3, and according to Intel, the E5-2620 v4 is not one of them.
Waiting in terms of latency. When the bus is mostly empty and it takes a while to make a round trip it's great to try to find a few extra passengers to put on it. When the buses are all completely full adding the extra riders just makes the bus stop that much more chaotic.
So either you have a v2 instead of a v4 (and run on DDR3 memory), or you have a v4 but with DDR4 memory (not DDR3)
Everything else doesn't work
https://www.intel.com/content/www/us/en/products/sku/92986/i...
Yup that's odd... I've got a Xeon 2680 v4 (14 cores) (amazing bargain of a little beast btw) and it's indeed on DDR4 and I saw all Xeons v4 as supporting DDR4 only.
Full spec (brand/model/mobo type) would have been nice: mine's an HP Z440 workstation repurposed as a server (which I only turn on when I'm working and which I religiously turn off before going to bed).
You say it runs "at reading speed". Have you benchmarked it?
Noted, and agree (it looks like it has also already been clicked, which I dislike). I honestly I need to redo the themes.
> You say it runs "at reading speed". Have you benchmarked it?
At some point a few weeks ago, yes I think so, but I didn't write it down for some reason... so I'll have to find a time when it's not busy and do it again without a noisy system. Right now the system is noisy, but that said doing it like this:
llama-cli --model gemma-4-26B-A4B-it-Q8_0.gguf --model-draft gemma-4-26B-A4B-t-assistant-GGUF/wikitext-2-raw_ik-llama-mtp_drafter-conservative/gemma-4-26B-A4B-it-assistant-Q8_0.gguf --spec-type mtp --draft-max 3 --draft-p-min 0.0 --color -sm graph -smgs -sas -mea 256 --split-mode-f32 --temp 0.7 --cpu-moe -t 8 --flash-attn on --mla-use 3 --merge-up-gate-experts --special --mlock --run-time-repack --spec-autotune --no-kv-offload --parallel 8 --jinja -p "Why is the sky blue?" -n 128
Gives:
llama_print_timings: load time = 83911.65 ms
llama_print_timings: sample time = 26.99 ms / 128 runs ( 0.21 ms per token, 4742.15 tokens per second)
llama_print_timings: prompt eval time = 343.41 ms / 7 tokens ( 49.06 ms per token, 20.38 tokens per second)
llama_print_timings: eval time = 10639.36 ms / 127 runs ( 83.77 ms per token, 11.94 tokens per second)
llama_print_timings: total time = 11114.98 ms / 134 tokens
So 11.94 tokens per second while it's also playing binary cache and CI builder.When I do it properly, I'll add it to the blog as well!
A GPU typically processes close to 1000 tokens/s during eval.
Hyperscalers can perform this evaluation very quickly because evaluation can be significantly parallelized. The layer `i` output of token `j` only requires access to the layer `i-1` output of all previous tokens, so a parallel frontier develops. Token (0,0) [(token, layer)] is processed first, then tokens (0,1) and (1,0) can be processed in parallel, then (0,2), (1,1), and (2,0), and so on.
The maximum parallel width becomes equal to the number of layers in the model. Gemma 4 26B-A4B model discussed in this article evidently has 30 layers, giving a 30-fold speedup if the system were otherwise unconstrained (all layers can be run in parallel, and one full set of layer outputs is completed in the KV pass for each pass of the parallel sweep).
In the specific output above, however, the input prompt is only seven tokens long so there are probably considerable non-amortized spinup effects at play.
As long as performance is useable (apply your own metrics!), pulling it from existing hardware is likely the option with the lower eco footprint.
Also: chances are it'll only be used for this purpose occasionally, and/or for a short while. In that scenario [fabricating new hardware] always has the bigger eco footprint.
EDIT: I stand corrected, 200W is apparently way too high of an estimate. I used to run a bunch of old Xeon servers and they slurped watts like crazy, but I can't remember which ones exactly those were.
There's a lot of budget hosting built around chips like these, and they're suprisingly power efficient.
[1] https://www.intel.com/content/www/us/en/products/sku/92986/i...
You likely need to replace the flow-through server chassis system with an active "normal" cooler to achieve a bit of silence.
85W might be about right. My old server CPU is in the same ballpark and compiling kernels it reached about 90w in power usage. If you want to keep it running: idle is not very low power unless you have one of the "low power" L versions, keep that in mind.
Well, you can use it for lots of other things as well.
Compared to the cloud you can probably save up to buy a new server every month. And don't underestimate the gains of having something to experiment on and play with.
An impressive effort, and better than I would have thought possible on this hardware -- but still pretty far short of what one needs for an satisfactory interactive session.
Here's my setup. You may want to figure out what the best optimizations are for your specific CPU like AVX2 because mine didn't have most of them. I did try MTP briefly but I wasn't getting performance improvements. You could play around with the batch sizes for cache or context or go even lower for Q2 and don't overcommit on threads either, but I would suggest either defaults or trying out llama-bench. This isn't by any means the best I assume but it worked decently for me and I sometimes swap out Gemma for Qwen. You could also lower q8_0 to q4_0 for more context but it could hurt quality some say, altough I have noticed it too on some models.
# Building
cmake -B build -DCMAKE_BUILD_TYPE=Release -DGGML_NATIVE=ON -DGGML_BLAS=ON -DGGML_BLAS_VENDOR=OpenBLAS -DGGML_OPENMP=ON
# Running
export OPENBLAS_NUM_THREADS=4
export OMP_NUM_THREADS=4
OPENBLAS_NUM_THREADS=4 OMP_NUM_THREADS=4 \
llama.cpp/build/bin/llama-server -hf unsloth/gemma-4-26B-A4B-it-GGUF:UD-Q4_K_XL --temp 1.0 --top-p 0.95 --top-k 64 --min-p 0.00 --jinja --host 0.0.0.0 --port 8080 --cache-type-k q8_0 --cache-type-v q8_0 --threads 4 --threads-batch 4 --ctx-size 8192 -n 8192 --batch-size 2048 --ubatch-size 512 --no-mmap --mlock --chat-template-kwargs '{"enable_thinking":false}' --no-mmproj -np 1 -fa 1
Should the market react to the memory shortage, the progress of the Apple silicon continue at the same pace, and what we’ll be able to run locally in 6 years will be very exciting. or frightening.
Also I don’t know what this means for the valuation of the AI companies. I remember asking about this very idea to one of their employees at an event and instead of answering he bailed out to grab a cocktail.
- There is no "moat" (lasting, easy-to-defend technological edge) in AI model businesses. There are just short-term advantages.
- An AI business is a capital-intensive business, just like old factories. Data centers are expensive, models are energy-hungry, and the hardware inside must be replaced every 3–4 years.
- Smaller, specialized models eat margins from below. Transcription, voice, or image detection do not need large models.
There is no reason to expect high margins like you can in traditional software business. Benefits of AI go mostly to consumers.
edit: There is potential for economies of scale. Few megacorps can strive for cost advantage when they achieve scale (Microsoft, Google, Amazon and Meta)
It does seem like the structural characteristics we’ve observed so far suggest there is a kind of flywheel from short-term to long-term advantage due to the capital requirements at various levels.
If you’re Nvidia, making the best GPUs today, the expanding wavefront of demand is consuming them with volume and margins to give you a huge edge in building out the best next generation of GPUs. Similar to how the mobile wave gave TSMC sustained advantage for about a decade now.
I’m guessing this is also what we’re seeing as Anthropic and OpenAI swap spots in the token-vendor market.
If you get a not-quite-the-best gaming GPU like a 5080, you can run local models that are better than the state of the art from early 2025. Depending on what you want to do, you might have to switch models. The one size fits all huge models are still a data center thing.
> The E5-2620 v4 is great. Have been using it for 10 years now.
10 years? Damn, that is a long time. I always assumed that heat-induced damage will kill a CPU after a certain amount of time (5-7 years). Am I wrong here? I assume yes. Or are CPUs must stronger/tougher than the bad old days?Even then, if a commodity chip isn't pushed full tilt at all times, and assuming that the venting and dissipation are adequate, a commodity chip can last a long time.
Bearings in fans, caps etc. are also stuff that you need to keep an eye on.
I just replaced a i5-660 thats been powered on since 2010 24/7, heatpaste was fucked so it crashed during heavy loads :)
Which makes sense I suppose.
At the low end, I'd use old Xeons with gobs of DDR3, install some V100s, run a smaller agent for general chat inquiries, and a frontier model for the deeper stuff, with a router that passes between them depending on the complexity.
The frontier model would perform very slowly, but if it's a deep task the user can submit it in a batch in the evening e.g. "Correlate all of these cases and look for patterns" then receive the output with morning coffee.
Of course, AI helped me work out a plan for this. Haha
Also I feel like everyone leaves off prompt processing/prefill speeds in these articles. If you are using a very small prompt and asking for mostly generated tokens, sure but I'd love to know the time-to-response of asking for an analysis of an image or a few hundred lines of code.
Admittedly web browsers and it don't get along that well. Literally the only thing that drags though on my Slackware 15 system, and even then usually only when it gets to around 15 or so open tabs.
High-Performance AI on a Budget: Optimizing llama.cpp for Qwen3.5 Inference on a Dual-GPU HP Z440
What was the net effect of the optimisations? How much faster did it get?
Plus many boards also support CXL for RAM expansion over PCI 5!
Source: building a hybrid inference business for regulated industry workloads.
I'd love if anyone knows how I might fare with an old Dell R710 with 2 x Xeon 5600 (12 cores total) and 96Gb of DDR3.
So you'd change the invocation slightly here, but a lot of things you can potentially reuse.
That said, the Gemma 4 E4B models have so far in my experience been... not great when it comes to long context, but they are very passable for basic tasks, and even seem surprisingly okay at tool calls.
~/ik_llama.cpp[main]$ build/bin/llama-cli --model ~/models/Qwen3.6-35B-A3B-Claude-4.7-Opus-Reasoning-Distilled.IQ4_XS.gguf --spec-type mtp --draft-max 3 --draft-p-min 0.0 --spec-autotune -cnv --color --jinja --special -smgs -sas -mea 256 --temp 0.7 -t 6 --parallel 6 --cpu-moe --merge-up-gate-experts --flash-attn on --mla-use 3 --mlock --run-time-repack --no-kv-offload . works pretty fast, at about 15 t/s:
llama_print_timings: sample time = 45.28 ms / 404 runs ( 0.11 ms per token, 8921.67 tokens per second) llama_print_timings: prompt eval time = 949.42 ms / 51 tokens ( 18.62 ms per token, 53.72 tokens per second) llama_print_timings: eval time = 24067.08 ms / 400 runs ( 60.17 ms per token, 16.62 tokens per second) llama_print_timings: total time = 242192.55 ms / 451 tokens
so i wonder why the params used by the quantified qwen model use way less memory than the ones of gemma.
https://pcpartpicker.com/products/motherboard/#s=20028,20029...
Model name: Intel(R) Xeon(R) CPU E3-1265L V2 @ 2.50GHz
Mainboard Product Name: P8Z77 WS
GPU 05:00.0 VGA compatible controller: NVIDIA Corporation AD106 [GeForce RTX 4060 Ti 16GB] (rev a1)
05:00.1 Audio device: NVIDIA Corporation AD106M High Definition Audio Controller (rev a1)
Memory: 32GBThis works.
https://www.techpowerup.com/cpu-specs/ryzen-7-4800u.c2281
It is way too slow
Totally just vibes based, I think it goes up to 20+ tps when it's not under load (and that's me trying to be conservative). For context, reading speed at 250 wpm would be around 5 to 6 tokens per second.
I use LM studio and qwen3.5 35B - but never figured out if it is swapping or not.
Om am unrelated note, does anyone know a model that can help with this use case:
Would love to see the benchmarks if someone actually pulls something like that off.
numactl --membind=1
so it is constrained to one of the memory sticks which speeds up token generation a little.
As for the second part, I am not sure about how living in a five eyes state would mitigate it. What do you mean by that?
Uh. Uuuh.
No?
___
Also
> While a GPU has a massive pool of ultra-fast High-Bandwidth Memory (HBM), a CPU relies on small, lightning-fast “caches” (L1, L2, L3) built directly onto the processor chip.
What purpose does the quoting of "caches" serve there? Is this AI writing written by that model running on that host?