Given the size of frontier models I would assume that they can incorporate many specializations and the most lasting thing here is the training environment.
But there is probably already some tradeoff, as GPT 3.5 was awesome at chess and current models don't seem trained extensively on chess anymore.
Right now, I believe we're seeing that the big general-purpose models outperform approximately everything else. Special-purpose models (essentially: fine tunes) of smaller models make sense when you want to solve a specific task at lower cost/lower latency, and you transfer some/most of the abilities in that domain from a bigger model to a smaller one. Usually, people don't do that, because it's a quite costly process, and the frontier models develop so rapidly, that you're perpetually behind them (so in fact, you're not providing the best possible abilities).
If/when frontier model development speed slows down, training smaller models will make more sense.
You do not believe that this has already started? It seems to me that we’re well into a massive slowdown
In practice, I upgraded everything to GPT-5 and the performance was so terrible I had to rollback the update.
Depends on what you compare it to. For us who were using o3/o1 Pro Mode before GPT-5, the new model isn't that huge of a leap, compared to whatever was before Pro Mode existed.
So even though you have high taxes and a restrictive alcohol policy, the end result is shops that have high customer satisfaction because they have very competent staff, excellent selection and a surprisingly good price for quality products.
The downsides are the limited opening hours and the absence of cheap low-quality wine - the tax disproportionally impacts the low quality stuff, almost nobody will buy shitty wine at $7 per bottle when the decent stuff costs $10, so the shitty wine just doesn't get imported. But for most of the population these are minor drawbacks.
Wow, I am so curious, can you provide me the source
I am so interested in a chess LLM's benchmark as someone who occasionally plays chess. I have thought about creating things like these but it would be very interesting to find the best model at chess which isn't stockfish/lila but general purpose large language models.
I also agree that there might be an explosion of purpose trained LLM's. I had this idea some year ago when there was llama / before deepseek that what if I want to write sveltekit and there are models like deepseek which know about sveltekit but they are so damn big and bloated when I only want to use sveltekit/svelte models. Yes there are thoughts on why we might need the whole network to get better quality but I genuinely feel like right now, the better quality is debtable thanks to all this benchmarkmaxxing and I would happily take a model trained on sveltekit on like preferably 4b-8b parameter but if an extremely good SOTA-ish model for sveltekit is even around 30-40b I would be happy since I could buy a gpu on my pc to run it or run it on my mac
I think my brother who actually knows what he's talking about in the AI space, (unlike me), also said the same thing a few months back to me as well.
In fact, its funny because I had asked him to please create a website comparing benchmarks of AI playing chess and having an option where we can make two AI LLM's play against each other and we can view it or we can also play against an LLM inside an actual chess board on the web and more..., I had given this idea to him a few months ago after the talk about small llm's really lol and he said that its good but he was busy right now. I think then later he might have forgotten about it and I had forgotten about it too until now.
Key memory unlocked. I had an Aha moment with this article, thanks a lot for sharing it, appreciate it.
As far as I remember, it's post-training that kills chess ability for some reason (GPT-3 wasn't post-trained).
Domain specific models have been on the roadmap for most companies for years now for both competitive (why give up your moat to OpenAI or Anthropic) and financial (why finance OpenAI's margins) perspective.
That you can individually train and improve smaller segments as necessary
only uses 1/18th of the total parameters per token. It may use the large fraction of them in a single query.
To meet this challenge, we introduce Game-TARS: a next-generation generalist game agent designed to master complex video games and interactive digital environments using human-like perception, reasoning, and action. Unlike traditional game bots or modular AI frameworks, Game-TARS integrates all core faculties—visual perception, strategic reasoning, action grounding, and long-term memory—within a single, powerful vision-language model (VLM). This unified approach enables true end-to-end autonomous gameplay, allowing the agent to learn and succeed in any game without game-specific code, scripted behaviors, or manual rules.
With Game-TARS, this work is not about achieving the highest possible score in a single game. Instead, our focus is on building a robust foundation model for both generalist game-playing and broader computer use. We aim to create an agent that can learn to operate in any interactive digital environment it encounters, following instructions just like a human.
So yeah I think there are different levels of thinking, maybe future models with have some sort of internal models once they recognize patterns of some level of thinking, I'm not that knowledgeable of the internal workings of LLMs so maybe this is all nonsense.
Not to different from a lot of consulting reports, in fact, and pretty much of no value if if you’re actually trying to learn something.
Edit to add: even the name “deep research” to me feels like something defined to appeal to people who have never actually done or consumed research, sort of like the whole “phd level” thing.
For sure it's probably missing stuff that a well payed lawyer would catch, but for a project with zero budget it's a massive step up over spending hours reading through search results and trying to cobble something together myself.
Whereas with real legal advice, your lawyer will carry Professional Indemnity Insurance which will cover any costs incurred if they make a mistake when advising you.
As you say, it's a reasonable trade-off for you to have made when the alternative was sifting through the legislation in your own spare time. But it's not actually worth very much, and you might just as well have used a general model to carry out the same task and the outcome would likely have been much the same.
So it's not particularly clear that the benefits of these niche-specific models or specialised fine-tunes are worth the additional costs.
(Caveat: things might change in the future, especially if advancements in the general models really are beginning to plateau.)
ask a loaded, "filter question" I more or less know the answer for, and mostly skip the prose and get to the links to its sources.
I wrote it back when AI web search was a paid feature and I wanted access to it.
At the time Auto-GPT was popular and using the LLM itself to slowly and unreliably do the research.
So I realized a Python program would be way faster and it would actually be deterministic in terms of doing what you expect.
This experience sort of shaped my attitude about agentic stuff, where it looks like we are still relying too heavily on the LLM and neglecting to mechanize things that could just work perfectly every time.
My point was it's silly to rely on a slow, expensive, unreliable system to do things you can do quickly and reliably with ten lines of Python.
I saw this in the Auto-GPT days. They tried to make GPT-4 (the non-agentic one with the 8k context window) use tool calls to do a bunch of tasks. And it kept getting confused and forgetting to do stuff.
Whereas if you just had
for page in pages: summarize(page)
it works 100% of the time, can be parallelized etc.
And of course the best part is that the LLM itself can write that code, i.e. it already has the power to make up for its own weaknesses, and make (parts of itself) run deterministically.
---
On that note, do you know more about the environment they ran this thing in? I got API access (it's free on OpenRouter), but I'm not sure what to plug this into. OpenRouter provides a search tool, but the paper mentions intelligent context compression and all sorts of things.
I use it dozens of times per day, and typically follow up or ask refining questions within the thread if it’s not giving me what I need.
It typically takes between 10sec and 5 minutes, and mostly replicates my manual process - search, review results, another 1..N search passes, review, etc. Initially it rephrases/refines my query, then builds a plan, and this looks a lot like what I might do manually.
Then I can further interrogate the information returned with a vanilla LLM.
Besides I might give other large deep research models a try when needed.
I had once an idea of using something like qwen4 or some pre-trained AI model just to do a (to censor or not to) idea after the incidents of mecha-hitler. I thought if there was some extremely cheap model which could detect that it is harmful that the AI models of Grok itself couldn't recognize, it would've been able to prevent the absolute advertising/ complete disaster that happened.
What are your thoughts on it? I would love to see an Qwen 4B of something similar if possible if you or anyone is up to the challenge or any small LLM's in generals. I just want to know if this idea fundamentally made sense or not.
Another idea was to use it for routing purposes similar to what chatgpt does but I am not sure about that now really but I still think that it maybe worth it but this routing idea I had was before chatgpt had implemented it, so now after it implemented, we are gonna find some more data/insights about if its good or not/ worth it, so that's nice.
You don't really need an entire LLM to do this - lightweight encoder models like BERT are great at sentiment analysis. You feed it an arbitrary string of text, and it just returns a confidence value from 0.0 to 1.0 that it matches the characteristics you're looking for.
function replaceInTextNodes(node) { if (node.nodeType === Node.TEXT_NODE) { node.nodeValue = node.nodeValue .replace(/\u00A0/g, ' '); } else { node.childNodes.forEach(replaceInTextNodes); } }
replaceInTextNodes(document.body);
The script is great!
Constraints are the fun part here. I know this isn't the 8x Blackwell Lamborghini, that's the point. :)
If you really do have a 2080ti with 128gb of VRAM, we'd love to hear more about how you did it!
It's slower than a rented Nvidia GPU, but usable for all the models I've tried (even gpt-oss-120b), and works well in a coffee shop on battery and with no internet connection.
I use Ollama to run the models, so can't run the latest until they are ported to the Ollama library. But I don't have much time for tinkering anyway, so I don't mind the publishing delay.
This comfortably fits FP8 quantized 30B models that seem to be "top of the line for hobbyists" grade across the board.
- Ryzen 9 9950X
- MSI MPG X670E Carbon
- 96GB RAM
- 2x RTX 3090 (24GB VRAM each)
- 1600W PSU
Of course this is in a single-user environment, with vLLM keeping the model warm.
I think you mean ram and no vram. AFAIK this is a 30b moe model with 3b active parameters. Comparable to the Qwen3 MOE model. If you do not expect 60 tps such models should run sufficiently fast.
I run the Qwen3 MOE Model (https://huggingface.co/unsloth/Qwen3-30B-A3B-GGUF/blob/main/...) in 4-bit quantization on an 11 year old i5-6600 (32GB) and a Radeon 6600 with 8GB. According to a quick search your card is faster than that and I get ~12 tps with 16k context on Llama.cpp, which is ok for playing around.
My Radeon (ROCm) specific batch file to start this:
llama-server --ctx-size 16384 --flash-attn on --cache-type-k q8_0 --cache-type-v q8_0 --device ROCm0 -ngl -1 --model /usr/local/share/gguf/Qwen3-30B-A3B-Q4_0.gguf --cache-ram 16384 --cpu-moe --numa distribute --override-tensor "\.ffn_.*_exps\.weight=CPU" --jinja --temp 0.7 --port 8080
This can end up getting you 128gb of VRAM for under $1000.
get the biggest one that will fit in your vram.
(If nothing else Tongyi are currently winning AI with cutest logo)
The Chinese version of the link says "通义 DeepResearch" in the title, so doesn't look like the "agree" to be the case. Completely agreed that it would be hilarious.
1: https://www.alibabacloud.com/en/solutions/generative-ai/qwen...
I switch between gemini and ChatGpt whenever I feel one fails to fully grasp what I want, I do coding in claude.
How are they supposed to become the 1 trillion dollar company they want to be, with strong competition and open source disruptions every few months?
Arguably LLMs are both (1) far easier to switch between models than it is today to switch from AWS / GCP / Azure systems, and (2) will be rapidly decreasing switching costs for your legacy systems to port to new ones - ie Oracle's, etc. whole business model.
Meanwhile, the whole world is building more chip fabs, data centers, AI software/hardware architectures, etc.
Feels more like we're headed to commodification of the compute layer more than a few giant AI monopolies.
And if true, that's actually even more exciting for our industry and "letting 100 flowers bloom".
The underlying architecture isnt special, the underlying skills and tools aren't special.
There is nothing openAI brings to the table other than a willingness to lie, cheat, and steal. That only gives you an edge for so long.
The pattern is effectively long-running research tasks that drive a search tool. You give them a prompt, they churn away for 5-10 minutes running searches and they output a report (with "citations") at the end.
This Tongyi model has been fine-tuned to be really good at using its search tool in a loop to produce a report.
So without specifying which model is being used, it's really hard to know what is better than something else, because we don't understand what the underlying model is, and if it's better because of the model itself, or the tooling, which feels like an important distinction.
https://openrouter.ai/alibaba/tongyi-deepresearch-30b-a3b
https://openrouter.ai/alibaba/tongyi-deepresearch-30b-a3b:fr...
n-cpu-moe in https://github.com/ggml-org/llama.cpp/blob/master/tools/serv...
You are better off asking it a write a script to invoke itself N times across the task list.
LLMs are really bad at being comprehensive, in general, and from one inference to the next their comprehensive-ness varies wildly. Because LLMs are surprising the hell out of everyone with their abilities, less attention is paid to this; they can do a thing well, and for now that’s good enough. As we scale usage, I expect this gap will become more obvious and problematic (unless solved in the model, like everything else).
A solution I’ve been toying with is something like a reasoning step, which could probably be done with mostly classical NLP, that identifies constraints up front and guides the inference to meet them. Like a structured output but at a session level.
I am currently doing what you suggest though, I have the agent create a script which invokes … itself … until the constraints are met, but that obviously requires that I am engaged there; I think it could be done autonomously, with at least much better consistency (at the end of the day even that guiding hand is inference based and therefore subject to the same challenges).
For most plans, Deep Research is capped at around 20 sources, making it for many cases the least useful research agent, in particular worse than a thinking mode Gpt5 query
I tied it together with qwen3 30b thinking. Very easy to get it up and running, but lots of the numbers are shockingly low. You need to boost iterations and context. Especially easy if you already run searxng locally.
I havent finished tuning the actual settings, but for the detailed report it'll take ~20 minutes and so far has given pretty good results. Similar to openai's deep research. Mine often has ~100 sources.
But something I have noticed. It didnt seem to me the model was important. The magic was moreso in the project. Getting deep with higher iterations and more results.
What is the state of AI in China? My personal feeling is that it doesn't dominate the zeitgeist in China as it does in the US and despite this because of the massive amount of intellectual capital they have just a small portion of their software engineering talent working on this is enough to go head to head with us even though it only takes a fraction of their attention.
No, the reason you don't see many open source models coming from the rest-of-world (other than Mistral in France) is that you still need a ton of capital to do it. China can compete because the CCP used a combination of the Great Firewall and lax copyright/patent enforcement to implement protectionism for internet services, which is a unique policy (one that obviously came with massive costs too). This allowed China to develop home grown tech companies which then have the datacenters, capital and talent density to train models. Rest of world didn't do this and wasn't able to build up domestic tech industries competitive with the USA.
On matters protectionism,the Great Firewall was the best thing that China did.It prevented them from digital colonization like the rest of the world.
Chinese labs are mostly (all?) privately funded, as far as I know. Alibaba isn't a SOE. That's why I didn't mention state subsidies, although that might be happening (and certainly is happening w.r.t. access to electricity).
I didn't mention lax copyright/patent enforcement in the context of AI, but rather, the prior years in which China was able to build up local tech firms capable of taking on the US tech firms. It's mostly in the past now, they don't need to do that stuff anymore.
China is also more willing to deploy AI apps that Americans would hesitate on, although I'm not sure I've seen much of it so far outside of Shenzhen cyberpunk clips. Let's see how this plays out in a decade.
All this to ask the question, if I host these open source models locally, how is the user interface layer that remembers and picks the right data from my previous session and the agentic automation and others implemented? Do I have to do it myself or are the free options for that?