I would expect over the medium term agent platforms to trounce un-augmented human testing teams in basically all the "routinized" pentesting tasks --- network, web, mobile, source code reviews. There are too many aspects of the work that are just perfect fits for agent loops.
Just registering the prediction.
# if >10 then was_created_by_agent = true
$ grep -oP '\p{Emoji}' vulns.md | wc -lWhereas finding novel exploits would still be the domain of human experts?
With more specificity: I would not be at all surprised if the "industry standard" netpen was 90%+ agent-mediated by the end of this year. But I also think that within the next 2-3 years, that will be true of web application testing as well, which is in a sense a limited (but important and widespread) instance of "novel vulnerability" discovery.
However, I currently believe that forensic investigations will change post LLMs, because they're very good at translating arbitrary bytecode, assembly, netasm, intel asm etc syntax to example code (in any language). It doesn't have to be 100% correct in those translations, that's why LLMs can be really helpful for the discovery phase after an incident. Check out the ghidra MCP server which is insane to see real-time [2]
These kind of things are very hard for LLMs because they tend to forget way too much important information about both the code (in the branching sense) and the program (in the memory sense).
I can't provide a schematic for this, but it's pretty common in binary exploitation CTF events, and kind of mandatory knowledge about exploit development.
I listed some nice CTFs we did with our group in case you wanna know more about these things [1]. I think in regards to LLMs and this bypass/sidechannel attacks topic I'd refer to the Fusion CTF [2] specifically, because it covers a lot of examples.
I think people are coming to this with the idea that a pentesting agent is pulling all its knowledge of vulnerabilities and testing patterns out of its model weights. No. The whole idea of a pentesting agent is that the agent code --- human-mediated code that governs the LLM --- encodes a large amount of knowledge about how attacks work.
The former is automated by a large part already with fuzz testing of all kinds, so you wouldn't need an LLM if you knew what you were doing and have a TDD workflow or similar that checks against memleaks (say, with valgrind or similar approaches).
The latter part is what I was referring to where I had hope initially that DNCs could help with that, and what I'd say that right now LLMs cannot discover this, only repeat and translate it (e.g. similar vulnerabilities in the past discovered by humans in another programming language).
I'm talking specifically about discovery here because transformers lose symbolic inference, and that's why you can't use them for exploit generation. At least I wasn't able to make them work for the DARPA challenges, and had to use an AlphaGo based model combined with a CPPN and some techniques that worked in ES/HyperNEAT.
I suppose what I'm trying to say is that there's a missing understanding of memory and time when it comes to LLMs. And that is usually manually encoded/governed how you put it by humans. And I would not count that as an LLM doing it, because you could have just automated the tool use without an LLM and get identical results. (When thinking e.g. about an MCP for kernel memory maps or say, valgrind or AFL etc)
Remember, the competition here is against human penetration testers. Humans are extremely lossy testing agents!
If the threshold you're setting is "LLMs can eradicate memory disclosure bugs by statically analyzing codebases to the point of excluding those vulnerabilities as valid propositions", no, of course that isn't going to happen. But nothing on the table today can do that either! That's not the right metric.
Ha, I laughed at that one. I suppose you're right :D
If that part of the job is automated away. I wonder how the talent and skill for finding those exploits will evolve.
Agent platforms have similar modes of failure, whether it's creative writing, coding, web design, hacking, or any other sort of project scaffolding. A lot of recent research has dealt with resolving the underlying gaps in architectures and training processes, and they've had great success.
I fully expect frontier labs to have generalized methodologizing capabilities withing the first half of the year, and by the end of the year, the Pro/Max/Heavy variants of the chatbots will have the capabilities baked in fully. Instead of having Codex or Artemis or Claude Code, you can just ask the model to think and plan your project, whatever the domain, and get professional class results, as if an expert human was orchestrating the project.
All sorts of complex visual tool use like PCB design and building plans and 3d modeling have similar process abstractions, and the decomposition and specialized task executions are very similar in principle to the generalized skills I mentioned. I think '26 is going to be exciting as hell.
Where they shine is the interpretive grunt work: "help me figure out where the auth logic is in this obfuscated blob", "make sense of this minified JS", "what's this weird binary protocol doing.", "write me a Frida script to hook these methods and dump these keys" Things that used to mean staring at code for hours or writing throwaway tooling now takes a fraction of the time. They're straight up a playing field leveler.
Folks with the hacker's mindset but without the programming chops can punch above their weight and find more within the limited time of an engagement.
Sure they make mistakes, and will need babysitting a lot. But it's getting better. I expect more firms to adopt them as part of their routine.
It might be the beer talking, but everytime someone comments on AI they have to say something along the lines of "LLM do help". If i'm being really honest, the fact everyone has to mention this in every comment and every blog post and every presentation is because deep down everyone isn't buying it.
Wow banger of an argument.
>>It might be the beer talking, ...
I'm not really sure what you are expecting here.
Which makes me think: yes, llms can solve some of this, but still only some. It's more than a research tool, when you combine tools and agentic workflows. I don't see a reason it should slow down.
> The productivity gains from LLMs are real, but not in the "replace humans" direction.
Meanwhile the people who are explicitly on a side either say that there are no productivity gains or that nobody will have jobs in 6 months.
For example, I tended to avoid pen testing freelance work before AI because I didn't enjoy the tedious work of reading tons of documentation about random platforms to try to understand how they worked and searching all over StackOverflow.
Now with LLMs, I can give it some random-looking error message and it can clearly and instantly tell me what the error means at a deep tech level, what engine was used, what version, what library/module... I can pen test platforms I have 0 familiarity with.
I just know a few platforms, engines, programming languages really well and I can use this existing knowledge to try to find parallels in other platforms I've never explored before.
The other day, on HackerOne, I found a pretty bad DoS vulnerability in a platform I'd never looked into before, using an engine and programming language I never used professionally; I found the issue within 1 hour of starting my search.
There are multiple factors which are pulling me into cybersecurity.
Firstly, it requires less effort from me. Secondly, the amount of vulnerabilities seems to be growing exponentially... Possibly in part because of AI.
Even simple stuff like training the models to recognize when they're stuck and should just go clone a repo or pull up the javadocs instead of hallucinating their way through or trying simple internet searches.
> The AI bot trounced all except one of the 10 professional network penetration testers the Stanford researchers had hired to poke and prod, but not actually break into, their engineering network.
Oh, wow!
> Artemis found bugs at lightning speed and it was cheap: It cost just under $60 an hour to run. Ragan says that human pen testers typically charge between $2,000 and $2,500 a day.
Wow, this is great!
> But Artemis wasn’t perfect. About 18% of its bug reports were false positives. It also completely missed an obvious bug that most of the human testers spotted in a webpage.
Oh, hm, did not trounce the professionals, but ok.
(There is enormous variance in what clients actually pay for work; the right thing, I think, to key off of is comp rates for people who actually deliver work.)
In the early 2000's banks were paying ~£1000-£1200/day for pentesters from boutiques and when I stopped being in that industry ~5 years ago, it was largely the same or even lower for larger companies that could negotiate day-rates down. Big-4 tried to charge more but that's really tricky when you're in direct competition with boutiques who have more testers than you.
By contrast US rates were a lot higher ($2k+/day) and also scopes were larger. A UK test for a web app could be as low as 3 days (even less for unauthenticated) where the US tended to be 1-2 weeks.
One reason they've gone down is outsourcing to lower cost regions, and I'd guess that LLM/AI automation will accelerate that trend...
If this is inexpensive (in terms of cost/time) it will likely make business sense even with false positives.
> A1 cost $291.47 ($18.21/hr, or $37,876/year at 40 hours/week). A2 cost $944.07 ($59/hr, $122,720/year). Cost contributors in decreasing order were the sub-agents, supervisor and triage module. *A1 achieved similar vulnerability counts at roughly a quarter the cost of A2*. Given the average U.S. penetration tester earns $125,034/year [Indeed], scaffolds like ARTEMIS are already competitive on cost-to-performance ratio.
The statement about similar vulnerability counts seems like a straight up lie. A2 found 11 vulnerabilities with 9 of these being valid. A1 found 11 vulnerabilities with 6 being valid. Counting invalid vulerabilities to say the cheaper agent is as good is a weird choice.
Also the scoring is suspect and seems to be tuned specifically to give the AI a boost, heavily relying on severity scores.
Also kinda funny that the AI's were slower than all the human participants.
An Exec is gonna read this and start salvating at the idea of replacing security teams.
I don't know enough about the low-end market to rebut you there (though: I saw what my muni paid for a bargain-basement assessment and was not OK with it), but the high end of the market definitely has not been slaughtered, and I definitely think that is coming.
Juniors will have a hard time that I agree. The current level of findings of LLM is at their level.
I also wanted to capture what's in my head from doing bug bounties (my hobby) and 15+ years in appsec/devsecops to get it "on paper". If anyone would like to kick the tires, take a look, or tell me it's garbage feel free to email me (in my profile).
load framework, run scripts, copy-paste screenshots, give presentation.
the juniors aren't doing scoping calls and follow-ups, unless the top-kick needs explanations
I wouldnt be surprised if they get near cost parity. Maybe 20% difference.
*Edit: the paper seems to suggest they had a 'Triager' for vulnerability verification, and obviously that didn't catch all the false positives either, ha.