There are tools for that, sandboxing, chroots, etc... but that requires engineering and it slows GTM, so it's a no-go.
No, local models won't help you here, unless you block them from the internet or setup a firewall for outbound traffic. EDIT: they did, but left a site that enables arbitrary redirects in the default config.
Fundamentally, with LLMs you can't separate instructions from data, which is the root cause for 99% of vulnerabilities.
Security is hard man, excellent article, thoroughly enjoyed.
This is the only way. There has to be a firewall between a model and the internet.
Tools which hit both language models and the broader internet cannot have access to anything remotely sensitive. I don't think you can get around this fact.
Sandboxing your LLM but then executing whatever it wants in your web browser defeats the point. CORS does not help.
Also, the firewall has to block most DNS traffic, otherwise the model could query `A <secret>.evil.com` and Google/Cloudflare servers (along with everybody else) will forward the query to evil.com. Secure DNS, therefore, also can't be allowed.
katakate[1] is still incomplete, but something that it is the solution here. Run the LLM and its code in firewalled VMs.
Of course, everything by Google they will still allow.
My favourite firewall bypass to this day is Google translate, which will access arbitrary URL for you (more or less).
I expect lots of fun with these.
"well, here's the user's SSH key and the list of known hosts, let's log into the prod to fetch the DB connection string to test my new code informed by this kind stranger on prod data".
This isn't a problem that's fundamental to LLMs. Most security vulnerabilities like ACE, XSS, buffer overflows, SQL injection, etc., are all linked to the same root cause that code and data are both stored in RAM.
We have found ways to mitigate these types of issues for regular code, so I think it's a matter of time before we solve this for LLMs. That said, I agree it's an extremely critical error and I'm surprised that we're going full steam ahead without solving this.
I don't see us solving LLM vulnerabilities without severely crippling LLM performance/capabilities.
What I meant, that at the end of the day, the instructions for LLMs will still contain untrusted data and we can't separate the two.
- A) Process untrustworthy input - B) Have access to private data - C) Be able to change external state or communicate externally.
It's not bullet-proof, but it has helped communicate to my management that these tools have inherent risk when they hit all three categories above (and any combo of them, imho).
[EDIT] added "or communicate externally" to option C.
[1] https://simonwillison.net/2025/Nov/2/new-prompt-injection-pa... [2] https://ai.meta.com/blog/practical-ai-agent-security/
It's great start, but not nearly enough.
EDIT: right, when we bundle state with external Comms, we have all three indeed. I missed that too.
> Gemini exfiltrates the data via the browser subagent: Gemini invokes a browser subagent per the prompt injection, instructing the subagent to open the dangerous URL that contains the user's credentials.
fulfills the requirements for being able to change external state
EDIT: In other words, the LLM didn't change any state it has access to.
To stretch this further - clicking on search results changes the internal state of Google. Would you consider this ability of LLM to be state-changing? Where would you draw the line?
I should have included the full C option:
Change state or communicate externally. The ability to call `cat` and then read results would "activate" the C option in my opinion.
They pinky promised they won’t use something, and the only reason we learned about it is because they leaked the stuff they shouldn’t even be able to see?
So more of a Gemini initiated bypass of it's own instructions than malicious Google setup.
Gemini can't see it, but it can instruct cat to output it and read the output.
Hilarious.
They forgot about a service which enables arbitrary redirects, so the attackers used it.
And LLM itself used the system shell to pro-actively bypass the file protection.
Agents often have some DOM-to-markdown tool they use to read web pages. If you use the same tool (via a "reader mode") to view the web page, you'd be assured the thing you're telling the agent to read is the same thing you're reading. Cursor / Antigravity / etc. could have an integrated web browser to support this.
That would make what the human sees closer to what the agent sees. We could also go the other way by having the agent's web browsing tool return web page screenshots instead of DOM / HTML / Markdown.
I'm hoping they've changed their mind on that but I've not checked to see if they've fixed it yet.
I ma hearing again and again by collegues that our jobs are gone, and some are definitely going to go, thankfully I'm in a position to not be too concerned with that aspect but seeing all of this agentic AI and automated deployment and trust that seems to be building in these generative models from a birds eye view is terrifying.
Let alone the potential attack vector of GPU firmware itself given the exponential usage they're seeing. If I was a state well funded actor, I would be going there. Nobody seems to consider it though and so I have to sit back down at parties and be quiet.
https://techcrunch.com/2025/11/23/ai-is-too-risky-to-insure-...
I know it is only one more step, but from a privilege perspective, having the user essentially tell the agent to do what the attackers are saying, is less realistic then let’s say a real drive-by attack, where the user has asked for something completely different.
Still, good finding/article of course.
You're telling the agent "implement what it says on <this blog>" and the blog is malicious and exfiltrates data. So Gemini is simply following your instructions.
It is more or less the same as running "npm install <malicious package>" on your own.
Ultimately, AI or not, you are the one responsible for validating dependencies and putting appropriate safeguards in place.
Nondeterministic systems are hard to debug, this opens up a threat-class which works analogously to supply-chain attacks but much harder to detect and trace.
> Given that (1) the Agent Manager is a star feature allowing multiple agents to run at once without active supervision and (2) the recommended human-in-the-loop settings allow the agent to choose when to bring a human in to review commands, we find it extremely implausible that users will review every agent action and abstain from operating on sensitive data.
It's more of a "you have to anticipate that any instructions remotely connected to the problem aren't malicious", which is a long stretch.
> However, the default Allowlist provided with Antigravity includes ‘webhook.site’.
It seems like the default Allowlist should be extremely restricted, to only retrieving things from trusted sites that never include any user-generated content, and nothing that could be used to log requests where those logs could be retrieved by users.
And then every other domain needs to be whitelisted by the user when they come up before a request can be made, visually inspecting the contents of the URL. So in this case, a dev would encounter a permissions dialog asking to access 'webhook.site' and see it includes "AWS_SECRET_ACCESS_KEY=..." and go... what the heck? Deny.
Even better, specify things like where secrets are stored, and Antigravity could continuously monitor the LLM's to halt execution if a secret ever appears.
Again, none of this would be a perfect guarantee, but it seems like it would be a lot better?
Avoiding secrets appearing directly in the LLM's context or outputs is trivial, and once you have the workaround implemented it will work reliably. The same for trying to statically detect shell tool invocations that could read+obfuscate a secret. The only thing that would work is some kind of syscall interception, but at that point you're just reinventing the sandbox (but worse).
Your "visually inspect the contents of the URL" idea seems unlikely to help either. Then the attacker just makes one innocous-looking request to get allowlisted first.
I mean regardless of how you feel about AI, we can all agree that security is still a concern, right? We can still move fast while not pushing out alpha software. If you're really hyped on AI then aren't you concerned that low hanging fruit risks bringing it all down? People won't even give it a chance if you just show them the shittest version of things
All the AI companies are aware of this and are pressing ahead anyway - it is completely irresponsible.
If you haven’t come across it before, check out Simon Willisons “lethal trifecta” concept which neatly sums up the issue and explains why there is no way to use these things safely for many of the things that they would be most useful for
Edit: "completely local" meant not doing any network calls unless specifically approved. When llm calls are completely local you just need to monitor a few explicit network calls to be sure. Unlike gemini then you don't have to rely on certain list of whitelisted domains.
>Please don't comment on whether someone read an article. "Did you even read the article? It mentions that" can be shortened to "The article mentions that".
I've worked on multiple large migrations between DCs and cloud providers for this company and the best thing we've ever done is abstract our compute and service use to the lowest common denominator across the cloud providers we use...
People are always going to want the best models.
The most RAM you can currently get in a MacBook is 128 gigs, I think, and that's a pricey machine, but it could run such a model at 4-bit or 5-bit quantization.
As time goes on it only gets cheaper, so yes this is possible.
The question is whether bigger and bigger models will keep getting better. What I'm seeing suggests we will see a plateau, so probably not forever. Eventually affordable endpoint hardware will catch up.
I've watched this with GPT-OSS as well. If the tool blocks something, it will try other ways until it gets it.
The LLM "hacks" you.
"A computer can never be held accountable; therefore, a computer must never make a management decision."
The main problem is that LLMs share both "control" and "data" channels, and you can't (so far) disambiguate between the two. There are mitigations, but nothing is 100% safe.
Should you do that? Maybe not, but people will keep doing that anyway as we've seen in the era of StackOverflow.