Show HN: Forge – Guardrails take an 8B model from 53% to 99% on agentic tasks
185 points
11 hours ago
| 22 comments
| github.com
| HN
Hi HN, I'm Antoine Zambelli, AI Director at Texas Instruments.

I built Forge, an open-source reliability layer for self-hosted LLM tool-calling.

What it does:

- Adds domain-and-tool-agnostic guardrails (retry nudges, step enforcement, error recovery, VRAM-aware context management) to local models running on consumer hardware

- Takes an 8B model from ~53% to ~99% on multi-step agentic workflows without changing the model - just the system around it

- Ships with an eval harness and interactive dashboard so you can reproduce every number

I wanted to run a handful of always-on agentic systems for my portfolio, didn't want to pay cloud frontier costs, and immediately hit the compounding math problem on local models. 90% per-step accuracy sounds great, but with a 5-step workflow that's a 40% failure rate. No existing framework seemed to address this mechanical reliability issue - they all seemed tailor-made for cloud frontier.

Demo video: https://youtu.be/MzRgJoJAXGc (side-by-side: same model, same task, with and without Forge guardrails)

The paper (accepted to ACM CAIS '26, presenting May 26-29 in San Jose) covers the peer-reviewed findings across 97 model/backend configurations, 18 scenarios, 50 runs each. Key numbers:

- Ministral 8B with Forge: 99.3%. Claude Sonnet with Forge: 100%. The gap between a free local 8B model on a $600 GPU and a frontier API is less than 1 point.

- The same 8B local model with Forge (99.3%) outperforms Claude Sonnet without guardrails (87.2%) - an 8B model with framework support beats the best result you can get through frontier API alone.

- Error recovery scores 0% for every model tested - local and frontier - without the retry mechanism. Not a capability gap, an architectural absence.

I'm currently using this for my home assistant running on Ministral 14B-Reasoning, and for my locally hosted agentic coding harness (8B managed to contribute to the codebase!).

The guardrail stack has five layers, each independently toggleable. The two that carry the most weight (per ablation study with McNemar's test): retry nudges (24-49 point drops when disabled) and error recovery (~10 point drops, significant for every model tested). Step enforcement is situational - only fires for models with weaker sequencing discipline. Rescue parsing and context compaction showed no significance in the eval but are retained for production workloads where they activate once in a while.

One thing I really didn't expect: the serving backend matters. Same Mistral-Nemo 12B weights produce 7% accuracy on llama-server with native function calling and 83% on Llamafile in prompt mode. A 75-point swing from infrastructure alone. I don't think anyone's published this because standard benchmarks don't control for serving backend.

Another surprise: there's no distinction in current LLM tool-calling between "the tool ran successfully and returned data" and "the tool ran successfully but found nothing." Both return a value, the orchestrator marks the step complete, and bad data cascades downstream. It's the equivalent of HTTP having 200 but no 404. Forge adds this as a new exception class (ToolResolutionError) - the model sees the error and can retry instead of silently passing garbage forward.

Biggest technical challenge was context compaction for memory-constrained hardware. Both Ollama and Llamafile silently fall back to CPU when the model exceeds VRAM - no warning, no error, just 10-100x slower inference. Forge queries nvidia-smi at startup and derives a token budget to prevent this.

How to try it:

- Clone the repo, run the eval harness on a model I haven't tested. If you get interesting results I'll add them to the dashboard.

- Try the proxy server mode - point any OpenAI-compatible client at Forge and it handles guardrails transparently. It's the newest model and I'd love more eyes on it.

- Dogfooding led me to optimize model parameters in v0.6.0. The harder eval suite (26 scenarios) is designed to raise the ceiling so no one sits at 100%. Several that did on the original suite can't sweep it - including Opus 4.6. Curious if anyone finds scenarios that expose gaps I haven't thought of. Paper numbers based on pre v0.6.0 code.

Background: prior ML publication in unsupervised learning (83 citations). This paper accepted to ACM CAIS '26 - presenting May 26-29.

Repo: https://github.com/antoinezambelli/forge

Paper: https://www.caisconf.org/program/2026/demos/forge-agentic-re... https://github.com/antoinezambelli/forge/blob/main/docs/forg...

Dashboard: https://github.com/antoinezambelli/forge/docs/results/dashbo...

Escapade5160
1 hour ago
[-]
I've been saying for a while that given a proper harness, small local models can perform incredibly well. When you have a system that can try everything, it will eventually get it right as long as you can prevent it from getting it wrong in the meantime.
reply
zambelli
1 hour ago
[-]
Lol, I love that framing. Yeah, the small models have impressed me a lot during this work. The reasoning can be quite good, and definitely sufficient for a lot of cases. Just gotta nudge em back on track Every now and then and they'll figure it out.
reply
koolba
23 minutes ago
[-]
A thousand monkeys on a thousand typewriters…
reply
cornholio
54 minutes ago
[-]
If I understood correctly, the model will get it right because it knows when it isn't right.
reply
zambelli
53 minutes ago
[-]
Essentially, yes that's right! There's some subtlety in how to let it know it was wrong (returning things as tool errors because it trained on that), but that's the gist of it - sort of a self-correcting architecture.
reply
tim-projects
6 minutes ago
[-]
I've been working on the same thing and even nearly called it forge. Instead I called it hammer.

I'll be keen to look through the code on this!

reply
88j88
29 minutes ago
[-]
Something very similar I was experimenting with on, but had different results that you may be interested in, some of my findings were interesting

This was part of testing out how well a tool of mine worked (github.com/jsuppe/loom), which aims to be used to extracts requirements, specs, creates tests. At first I had no intention of using it for code generation but then tried it out with some early success. I tried splitting the work by using the tool with different frontier models, and then providing work to a local ollama instance running one of several models. Not all local models had the same outcome, not all coding languages had the same outcome. I also found in this experiment, when nailing down the coding tasks I wanted to set up positive and negative scenarios- which is where I found setting guardrails can sometimes backfire with inversion- this essentially elaborates on previous work by Khan 2025 (https://arxiv.org/abs/2510.22251); the most interesting finding to me was that if you give guardrails with a rationale, it reduces compliance and may cause the inversion

For coding tasks I found that the improvement was not only ability to use a lower cost model for these broken down tasks, but wall clock time was improved over using frontier model alone, with equivalent outcomes.

reply
zambelli
5 minutes ago
[-]
I've had a few reversions as well along the way, including in upcoming v0.7.0 patch. Some models benefitted, others regressed - overall better on harder scenarios or I wouldn't be releasing, but yeah - not intuitive.

The biggest challenge has been balancing the desire to hyper optimize for my favorite models, versus average behavior, versus consumer needs.

reply
jf
3 hours ago
[-]
Tangentially related: Since you are at Texas Instruments, I wonder if you could find out what the status is of the intellectual property for the TI Explorer lisp machines. I know who owns the IP for Genera, but wasn’t able to find out about TI’s lisp OS
reply
zambelli
3 hours ago
[-]
Very tangential! I'll try but it might take me a while.
reply
user3939382
12 minutes ago
[-]
Who owns the IP for Genera?
reply
tempoponet
21 minutes ago
[-]
Why this entire tool chain instead of building within something like pi code?

I've been exploring this area and a project like https://github.com/itayinbarr/little-coder (not my work) lets me mix and match with my current setup or any plugins built for pi.

reply
azurewraith
1 hour ago
[-]
Interestingly enough we have found the same net result -- structural guardrails are the unlock for smaller models. Our approach in particular layers three things: a parse rescue for malformed/incorrect tool calls (similar to your retry nudges), content-level intervention (diff size rejection, checkpoint forcing) and state machine enforcement on top (per-phase tool restriction, transition guards). On 13B models we saw completion of a selection of SWE-bench tasks went from ~20% to 100%. With frontier models we saw a reduction in API calls from reduced thrashing.

One of the most surprising findings was when a 9B model self-corrected through 4 tool parse failures within the guard rails. It tried to use a complex tool (patch_file), kept failing and eventually downshifted to a simpler tool (edit_line) that it could actually execute. The guardrails didn't make the model smarter, it just narrowed the execution space until it could find something that worked.

Brief: https://statewright.ai/research

reply
zambelli
1 hour ago
[-]
Nice! I'm not surprised at your findings (anymore). Mechanical reliability is the key to small models, and it's a big unlock. I've seen the same thing you just described. And the agnostic nudges forge sends at inspired by exactly that. Just show the model how it failed, gracefully, and it'll likely figure a way out of it itself.

Forge doesn't have a SWE-specific eval, but I've built a custom coding harness (not public yet but maybe soon) built on forge and saw the same behavior you seem to have seen in agentic coding.

reply
6r17
1 hour ago
[-]
Very cool work ! I'm running harness system myself and could measure improvement of token use of 2x to 10x on gsm8k only by running a math harness - i'm confident the future is bright for people who will know how to sell tech that is appropriately scaled to one's need. We absolutely do not need to run Claude 123 for most tasks and we better prepare for the rag-pull !
reply
bglusman
1 hour ago
[-]
Funny timing. I’ve been building something adjacent, though from a different angle: not primarily local-model reliability, but a control layer around agent execution, tools, routing, and operator intent. I was calling these "synthetic models", but decided yesterday "LLM middleware" is a clearer description.

Very early prototype, so I’m looking more for architectural/conceptual reactions than polish: https://wardwright.dev / https://github.com/bglusman/wardwright

The common thread I see is treating the harness around the model as first-class infrastructure. Forge seems focused on tool-call correctness and recovery; Wardwright is more about controlling what the agent is supposed to do, where work gets routed, and how the operator stays in the loop.

Curious whether you see those as complementary layers. I’m planning to try Forge and would be interested in seeing whether they fit together cleanly.

reply
bglusman
14 minutes ago
[-]
Ironically, the project this idea emerged out of for me is also called Forge, actually Calciforge… https://calciforge.org / https://github.com/bglusman/calciforge

Name was just a portmanteau of Calcifer's forge, because Howl’s moving castle seemed like a good metaphor for what I was trying to do… I had synthetic models as apiece there but I realized a) it was out of place and b) it was my favorite feature there

reply
zambelli
1 hour ago
[-]
Conceptually I think definitely! Forge has no opinion on what the agent should be trying to do, that's the "middleware"'s job, so to speak.

Forge is just trying to make sure that when the model decides to do something, thee execution is reliable.

As for software integration, let me know if you run into any issues and I'll be happy to take a look or try to patch something!

Harnesses as first class infra all the way. I'll take a look at your work and see if I spot any obvious tensions.

reply
Aleesha_hacker
2 hours ago
[-]
Impressive work, love seeing tools that boost local LLM reliability without touching the model itself
reply
zambelli
2 hours ago
[-]
Thank you! It was a really fun rabbit hole to fall into and I found a bunch of counterintuitive stuff.

I'm in the same boat, tuning models wasn't super interesting, though I might do a focused spike on behavior -focused fine tuning. But the harness matters almost more than the model in many cases.

reply
_pdp_
1 hour ago
[-]
Maybe I am reading it wrong but I don't think this does what it claim it does or at least how it sounds.

Basically this is a tool auto-complete that has a workflow element to it with certain steps that need to happen in certain order. In other words the order is defined in advance. Am I correct?

Basically execute step 1 first, then step 2 and finally step 3 and this is the schema for each step. That is effectively the guardrail and there is retry logic.

If it is the case, this is obviously useful but in a very specific set of problems where the solution is kind of known in advance. A workflow automation might work but this is kind of N8N where each step is LLM step.

Anyway, I might me wrong but I wanted to share a few thoughts.

reply
zambelli
1 hour ago
[-]
Partially correct, but an important distinction to call out.

You don't have to define the workflow steps. You can just expose the set of tools to the model and let the LLM call whatever it wants in any order, and every guardrail except the prerequisite step enforcement is still there to help.

If your workflow does have step enforcement, that can also be conditional. For example like Claude code does read required before edit. You can define a conditional enforcement where the agent must have called read before edit, and even force the same file path. That doesn't mean the model has to call edit at all...

But maybe I could have been clearer in the docs on the workflow pieces.

reply
_pdp_
1 hour ago
[-]
The docs should start with that with a very clean explanation how it works. Basically first paragraph. :)

Otherwise you should expect churn.

But also it should really go into some detail how is this different from tool calls with type enforcement on expected parameters.

reply
zambelli
1 hour ago
[-]
That's good feedback, thank you! I have an update landing shortly so I'll make sure to clarify in the docs! I appreciate it!
reply
jamesponddotco
1 hour ago
[-]
This seems pretty awesome; being able to use an 8B model for tool calling would be perfect.

Interested in using this for Home Assistant using a Mac Mini as my server. Does it run on MacOS?

How is the latency when using the proxy? I’m using Claude Haiku 4.5 for my voice assistant right now and it’s pretty fast, but if I could keep the LLM local, it’d be even better.

reply
zambelli
1 hour ago
[-]
I have an open GitHub issue for macOS hardware detection. I don't have a Mac myself to do dev on but happy to accept a fork! I did assign a buddy to that issue but she's been slacking - call her out :p.

Latency is dependent on the guardrails firing, effectively. If nothing fires, it's a passthrough, for all intents and purposes, very little overhead. But if a retry nudge fires then that's another LLM call.

As a consumer for a home assistant, a retry nudge firing is something I'd catch, and have my voice model output a pre-baked "one sec, trying again" sort of filler message or something.

reply
nzeid
1 hour ago
[-]
> # External mode — you manage llama-server, forge proxies it

> python -m forge.proxy --backend-url http://localhost:8080 --port 8081

This is a good example because I've currently stuck with llama.cpp's UI. I can read your code (or throw Gemma at it =p ) but thought I'd ask anyway.

In this example, what is it exactly that your proxy is fortifying? The HTTP SSE requests? (Those would be `/chat/completions`.)

reply
zambelli
1 hour ago
[-]
Yes that's correct !

/v1/chat/completions is the entry point.

In proxy mode, here's what forge applies on each request (handler.py builds these):

Response validation: ResponseValidator(tool_names) checks each tool call against the declared tools array. If the model emits a call to a name not in tools[], or a malformed call shape, it's caught before the response goes back.

Rescue parsing: When the model emits tool calls in the wrong format — JSON in a code fence, [TOOL_CALLS]name{args} (Mistral), <tool_call>...</tool_call> (Qwen XML) — rescue parsers extract the structured call and re-emit it in the canonical OpenAI tool_calls schema. This is the biggest practical lift, especially on Mistral-family models that ignore native FC and emit their own bracket syntax.

Retry loop with error tracking: ErrorTracker(max_retries=N) — if validation fails, forge retries inference up to N times with a corrective tool-result message on the canonical channel, rather than returning a malformed response to your caller. From your perspective the proxy looks like a single request that just took a few extra ms.

What proxy mode does NOT do (because it's single-shot, not multi-turn): prerequisite/step enforcement (those need a workflow definition spanning turns), context compaction, session memory. For that surface you wrap the WorkflowRunner class in Python — proxy mode trades that depth for "use forge with your existing setup, no Python rewrite."

So yes — the proxy is fortifying the response shape and retry behavior of /v1/chat/completions. The full agentic guardrails are at the Python class level above it.

For greenfield projects, I've been building on forge native using WorkflowRunner so I get all guardrails. But obviously as a drop-in replacement in existing systems then proxy is the way to go.

reply
cyanydeez
1 hour ago
[-]
the funniest thing I see in opencode with tool calling is the model calls 10.0 and opencode says it's an error because the spec is an integer, even though it's obvious to anyone that if a float can be coerced properly to a integer, then that should be a success.
reply
zambelli
1 hour ago
[-]
Yeah it's a delicate balance between precise and silly, and too permissive.

I'm definitely still iterating on forge, but so far sending the model a friendly and gracefully handled error message works wonders (instead of barfing a stack trace or something).

reply
__mharrison__
1 hour ago
[-]
Curious if this would help larger local models? Qwen 3.6 varieties of deepseek4?
reply
zambelli
1 hour ago
[-]
Yes it does! I haven't published those evals yet, but I'm actually running 24-35B class models on a custom coding harness built on forge (even 120B class recently).

I just need more GPU wall clock time to get more evals done. ETA is...a few weeks? Got distracted by the coding harness.

But the results are the same. Reforged models do better than bare, even at those sizes. As for published results, I ran forge on Anthropic models and reforged doe better than bare for them as well :)

reply
trollbridge
49 minutes ago
[-]
Exactly what I was thinking - even on frontier or near-frontier models I still see my agents get stuck in these pointless loops where it's very obvious to me what they need to do to get "unstuck".
reply
happycube
1 hour ago
[-]
If it's worth it to you, you could try running it on Deepseek v4 flash which is very cheap right now...
reply
lucrbvi
1 hour ago
[-]
How does this differ from dottxt's Outlines[0] on the technical level? Are you using some JSON grammar to force the LM head distribution to follow it?

[0]: https://github.com/dottxt-ai/outlines

reply
zambelli
1 hour ago
[-]
I only just skimmed it, but will try to dive deeper in a bit.

I think we share a lot on tool definitions/schemas. Forge will let a consumer define a tool, set of tools, pydantic schema for each, etc. outlines seems to be similar with their task definition.

I think where we differ is what happens when that doesn't work...and the model still doesn't get the contract right. Something like a pydantic-valid string path for glob, that points to a non-existent thing. Glob will error, forge catches, and nudges the model. Forge does very little model output manipulation (just a basic regex parse to try to find json/XML), the core of it is in the retry mechanisms.

Once I dig into it more I'll try to highlight other deltas.

reply
tommica
3 hours ago
[-]
What are "guardrails" in this context? Is it correctly understood that this would sit between my pi agent and llama-server, and it would do what exactly?
reply
zambelli
3 hours ago
[-]
It would help ensure that the model executes its tool call correctly. So if you give Pi a task like booking travel... Pi decides to book a flight, hotel, car. It gets the flight in one go, but then sends "here is the payload : [json blob]" to hotel booking API and the whole thing throws an error and the workflow dies, with partial completion. Forge would catch the error and nudge the model by injecting a message into the conversation history, with a helpful error message "You replied with text, you must call a tool", the model reads it, and submits a tool call.

Big frontier models need this less than small models.

reply
mholubowski
2 hours ago
[-]
Hey I'm really impressed and hoping to connect. I followed you on X just now, is that a decent place to shoot you a DM? I don't want anything from you, we just seem to be working on similar things (I'm working on our internal agent harness here, at a healthcare startup).
reply
zambelli
2 hours ago
[-]
Neat! Historically I've been most active on LinkedIn but the AI community seems very X-leaning so I'll make sure to pay closer attention there. Good luck with the harness, happy to connect!
reply
dpweb
3 hours ago
[-]
Hello. Interesting project! Haven't gone through it yet, but want to consider using this in my CS master's capstone. While you have benchmarks I may create my own specific scenarios and comparisons vis-a-vis hosted inference to highlight specific economic benefit. Any suggestions?
reply
zambelli
3 hours ago
[-]
Very cool! I would look at the tokens returned by each of the calls. You can map those to API costs per input/output tokens. Forge should be capturing those (or can, as passthrough from llama.cpp).

At least, if I understand your economic benefit angle correctly.

For scenarios to get inspired by I'd look at those tagged "model_quality" or "advanced_reasoning".

reply
zambelli
4 hours ago
[-]
Happy to answer questions about the eval methodology, the backend findings, or anything in the repo. I'll be around.
reply
schaefer
2 hours ago
[-]
super interesting work. It will take me a few days to dig in and really understand it. But I'm looking forward to it.

I run small models at home, so I'm very curious.

reply
zambelli
1 hour ago
[-]
That's awesome! Let me know if quick start is causing issues or anything else you'd like to dig into.

Out of curiosity, what models are you running?

reply
fabian_shipamax
3 hours ago
[-]
dashboard link is dead
reply
zambelli
3 hours ago
[-]
reply
schaefer
3 hours ago
[-]
yes, that link works for me.
reply
rebekkamikkoa
2 hours ago
[-]
Hi Antoine!

Interesting point about backend variance. Do you think serving layer should become part of standard LLM eval reporting?

reply
zambelli
1 hour ago
[-]
Hi! Yes, I definitely think so. I've seen variance across all model families I looked at. The magnitude changes, but the presence of variance is a constant.
reply
xiaod
3 hours ago
[-]
I'd be curious about the eval methodology. In production coding tasks, the gap between benchmark scores and actual workflow integration can be significant. What does the error recovery loop look like?
reply
zambelli
3 hours ago
[-]
Absolutely, benchmarks are a different breed. Forge's eval is deliberately scoped as a stress test of the recovery loop, not a measure of end-to-end agentic quality.

Scenarios range from basic 2-step workflows, to more complex ones with dead ends, breadcrumbs, misleading names.

Concrete example: Task: get, analyze and report on Q3 sales data.

Model emits: analyze_sales(quarter="Q3"). This skipped the fetch step. Forge's response validator catches it before the tool function runs. Instead of letting the bad call hit the real impl (which would error or hallucinate), forge replies on the canonical tool-result channel.

We send this to the model: tool_result: [PrereqError] analyze_sales requires fetch_sales_data to be called first. Available next steps: fetch_sales_data

Model emits a corrected fetch_sales_data(...) on the next turn.

Three enforcement paths use this same channel: prerequisite violations, premature terminal calls, unknown-tool retries.

We also have rescue parsing for known templates (Jason OpenAI style, XML like granite, etc) where we try to parse tool calls that might be malformed.

And lastly bare text response nudges. Small models love to chat, we need them to call tools!

reply
k__
3 hours ago
[-]
So, this basically ensures that models call the right tools with the correct format?
reply
zambelli
3 hours ago
[-]
In a nutshell, yes. It tries to anyways, but at the end of the day, some models get stuck and you hit a max iterations error that forge will raise, with some context, and the consumer can choose what it wants to do at that point.
reply
k__
3 hours ago
[-]
Ah, so it a "smart" retry mechanism?
reply
zambelli
3 hours ago
[-]
I'd like to think so! ;). It has some brains, but the key insight was to send the model domain-agnostic nudges. I don't need to know what you're trying to do, the LLM already knows, I just need to nudge it back on the structural side: text response vs tool call, arg mismatch, etc. and let its knowledge of the context fill in the blanks (otherwise I'd need a massive library of every possible failure mode).

The other insight was doing it at tool call level and not workflow level, which addresses the compounding math problem more directly.

reply
jimmySixDOF
2 hours ago
[-]
Maybe similar to Instructor [1] which was a cool tool for json and structured output enforcement combining pydandic with ai retry loops very handy for when models don't have that covered

[1] https://github.com/567-labs/instructor

reply
zambelli
1 hour ago
[-]
Interesting! I'll look into that. Would mean another dep/integration but might be more robust.
reply
snovv_crash
2 hours ago
[-]
I get a strong LLM smell in your description. If you couldn't bother to write it, why should I bother to read it?
reply
zambelli
2 hours ago
[-]
I definitely use LLMs to help write things - but this is my draft!

Maybe I've been spending too much time reading the evals and I now sound like an LLM...

Either way, here I am - happy to answer any questions!

reply
snovv_crash
2 hours ago
[-]
I guess it's that, and yes, much as they learned speech patterns from us, now we start to learn from them.

I play with local models a lot but also have limited time and the conciseness, polish and human indication in presentation has become a major quality indicator. I've wasted too much time with slop projects or people's LLM-induced delusions and now take a pretty strict line on what I'm willing to spend my time on. Even if this ends up with some false positives, there's just so much happening these days it doesn't really matter...

Best of luck with Forge!

reply
throwaway20222
2 hours ago
[-]
If you are so outright against using AI, why would he care if you read his article about AI?
reply
snovv_crash
2 hours ago
[-]
AI usage is great. The problem is the asymmetry in effort between generating text automatically, and then further amplifying this via posting it, while then expecting human eyeballs to spend the time reading it. It is antisocial.

If you're generating AI text you shouldn't expect humans that you aren't paying to bother reading it, purely out of politeness. Brian Cantrill has a great piece on this: https://rfd.shared.oxide.computer/rfd/0576

reply