2. This could have been a single web page that runs in your browser and lets you enter hardware specs, like all of the other tools like this. It is not a good idea to install and run unknown projects like this on your computer in this age.
3. The project is very obviously vibecoded, down to the README
4. Every comment from this account appears to be AI generated too.
I would recommend not installing and running this on your computer. There is no advantage over other tools and everything about the account and project looks like low effort AI generated content.
[0] https://artificialanalysis.ai/?models=gpt-oss-120b%2Cgemma-4...
The estimates seems far off as well, took https://www.canirun.ai/model/gpt-oss-120b as an example, with a RTX Pro 6000 and every single number is off, and notably misses estimation for the most important quant for GPT-OSS, the MXFP4 variant.
I run dgx spark, and the results here are soooo incomplete for my platform that I can’t trust this site (for my usecase).
"39d ago" in AI time is like 1 year outdated info.
Showing quality loss per quantization is nice.
I'd prefer this as a website, since I'd handle running of the model with a dedicated inference server anyway.
It would be nice to see what's the maximum context length that can fit on top of the baseline.
I was surprised how much token generation speed tanks when using very long context. 30/s can drop down to 2/s. A single speed metric didn't prepare me for that.
I was also positively surprised that some models scale well with batch parallelism. I can get 4x speed improvement by running 8 requests in parallel. But this affects memory requirements, and doesn't apply to all models and inference engines. It would be nice to show that. Some sites fold it into "what's your workflow", but that's too opaque.
KV cache quantization also makes a difference for speed, VRAM usage and max usable context.
On Apple Silicon MLX-compatible model builds make a difference, so I'd like to see benchmarks reassure they're based on the fastest implementation.
Multi-token-prediction is another aspect that may substantially change speed.
It seems pretty rubbish I have to say, its recommending me loads of qwen 2.5 which are really old and I'm easy running qwen3.5 and 3.6 models on this mac at decent quants
“I release software now, good luck everyone”
[0] https://github.com/AlexsJones/llmfit/tree/main/llmfit-web
Edit: I tried to deploy a snapshot of the llmfit-web files on Netlify but it seems to need/want to talk to a backend[1]
I just use llmfit.
If "biggest that fits" is the answer you want, llmfit is the simpler tool and Python won't matter to you. If you want "which fitting model is worth running," that ranking layer is the whole reason whichllm exists. Different jobs — I'd genuinely send fit-only users to llmfit.
If i ever decide to actually publish the site, is it alright if I mention you somewhere as a "If you want a more accurate estimation, check out this project:<your repo>", as i think there is value in having a simple website estimate this information for you, and give you instructions/ common flags on how to start it yourself (also a prompt crafted for you to optionally give to an llm to set it up for you), but im going off simple "choose an os, gpu/vram, here's a list of options" and not actually scanning (which is a lot more accurate).
So two questions there:
(1) is it actually possible to get good results with them (some people said they got good results, which implies that it might have been hard to get them running properly, but if you can, then they're actually good?). Which also implies the second question,
(2) are benchmarks a spook?
---
...Also, is OP Claude?
An uncensored qwen3.5/3.6 is more fun
also my personal simple rule of thumb for local ai sizing is:
max model size (GB) = ram (GB) / 1.65
GPU 0: STRXLGEN — 8.0 GB (ROCm 6.19.8-200.fc43.x86_64) — BW: N/A CPU: AMD RYZEN AI MAX+ 395 w/ Radeon 8060S — 16 cores (AVX2, AVX-512)
The 8GB is the reserved memory, but it's not the total available memory to the GPU.
Linux sets the unified memory like this on linux: https://www.jeffgeerling.com/blog/2025/increasing-vram-alloc...
Don't feel bad though, nvtop doesn't do it correctly either.
I’ve been using RapidMLX for this. The integrated speed tests matter because the quality of the backend is a moving target and the quantization / MLX format conversion also matter. It’s not enough to say “oh use this model family with X parameters” you have to add the architecture specific quantization too.