Continuous Offensive Security: The Line We've Been Walking

Escrito por

27 de mayo de 2026

0 minutos de lectura

AI Pentesting is having a moment.

Well, several moments, actually. Every other week, another vendor announces something, or another LLM-driven pentesting tool tops some benchmark on a target nobody's heard of, another deck claims a new "gold standard" being disrupted, at long last... It's been busy.

Underneath all the noise, though, there’s a real reason this is happening everywhere, all at once: the same reasoning capability that just made AI pentesting commercially viable is the same one attackers now have in their hands. Autonomous attackers are already probing application surfaces continuously at machine speed, on a schedule defenders can’t keep up with.

The race at hand is now whether your offensive security testing finds the flaws before an attacker’s offensive AI does. The vendor announcements aren’t the real story here: they’re just the market catching up to a problem that’s already arrived.

Working on Snyk API & Web for several years now, our Dynamic Security Testing product, I feel vindicated as Snyk announces Continuous Offensive Security. I want to use this post to do something other than just announce it: I want to explain why the line that runs from Dynamic Security Testing to AI Pentesting is one we've been walking for years now, and why anyone trying to build the second without a foundation in the first is, in my honest opinion, going to hit a ceiling quickly.

Heuristic-Detectable vs Context-Dependent

Here's the starting distinction that is key to understanding our whole point here.

Heuristic-Detectable vulnerabilities are the ones that show themselves to deterministic tools. SAST matches patterns in the source code itself, whereas DAST takes the other approach: it throws payloads at a running application or API and observes responses: error messages, time delays, behavioral differences, that kind of thing. Probe, observe, infer. SQL injection, cross-site scripting, misused APIs, classic injection patterns, all of these surface reliably through behavior heuristics can recognize. Scanners and DAST tools have become very good at this over the years. Hundreds of vulnerability classes in this category are now reliably caught across the SDLC, and that’s a real win.

Context-Dependent vulnerabilities are something else entirely. BOLA or IDOR, cross-tenant data leakage, authentication bypasses, and especially chained vulnerabilities, where a couple of mediums and highs combine into a critical exploit path. None of these have a heuristic signature you can probe for. You cannot write a rule, in SAST or in DAST, for “User A should not be able to read user B’s invoice”, that rule depends on what your application is SUPPOSED to do. The vulnerability lives in the gap between intended behavior and actual behavior, and there is no probe, no payload, no signature that can infer intent from the outside.

This is why pentesting has always been human-led. Until recently, only humans could acquire the contextual understanding required to find this second class. Heuristics find signatures or behaviors, Pentesters find what’s only revealed through context. As cool as that sounds, that's not a slogan; it's how the discipline has worked for as long as it's been a discipline.

And that's the line AI just crossed.

The lineage that actually matters

Here's the trade secret that isn't actually very secret: every credible DAST engine in the last decade was built by people who came from pentesting. The Snyk API & Web engine was no exception. The team that built it had spent years finding flaws by hand, and they designed it around what pentesters actually do: recon, probe, observe, reason, escalate, validate... the whole motion. Though, at the time, we could not mimic everything pentesters do. The missing piece was reasoning, which is exactly what we can do now, as LLMs understand context, and can reason.

That heritage is why our BOLA detection, which we shipped last year, works the way it does. It's not pattern-matching against a signature list, as there is no signature for BOLA: it's a chain of authorization probes guided by structural reasoning about how API objects relate to identities. It's automated flaw-hunting that crosses the threshold from "What does the code do?" into "What is it meant to do, and can I subvert that intent?"

It's been working in production for our customers for months. It's the proof that the line from DAST into reasoning-based testing is walkable. It's also not coincidentally why we'd been thinking about building the AI version of this long before "AI Pentesting" became a category anyone was raising money for.

What changed, and why now

What changed isn't the goal, but the cost.

Reasoning at scale used to require a human pentester. $20K to $50K per engagement. Two weeks of calendar time, on average. A coverage window that closed the moment the report shipped, by which point the application had already shipped three more releases.

That math is what manual pentesting was: irreplaceable, but constrained by the same thing every artisanal craft is constrained by: human time. Your pentest covers fifteen days a year. What's happening to the other three hundred and fifty?

AI changes the math, but not the discipline. The reasoning step that only a pentester could perform is now also something a sufficiently capable model can perform, repeatably, at a fraction of the cost.

And there's a third attack surface that AI itself created

Everything above is about an attack surface that has existed as long as web applications have: the heuristic-detectable and context-dependent vulnerabilities we described in traditional code, traditional APIs, and traditional architectures. AI changes the testing model, but it does not change the targets: these have been pentest material for two decades.

There is also an entirely new attack surface that AI itself created, and that did not exist just five years ago.

LLM-integrated applications, AI Agents calling tools, chatbots wired into customer data. Retrieval pipelines pulling from sources that an attacker can poison, prompt injection, or misuse. Data exfiltration through model outputs, or jailbreaks that turn a customer service Agent into a privileged actor with access it was never supposed to have. Manoj's piece walked through one version of this: an AI Agent calling an API nobody had stress-tested, triggered by nothing more exotic than an email address.

These attacks are not something you can scan for, and they are not flaws in the traditional architectural sense. They live in the gap between what an LLM was prompted to do and what an attacker can convince it to do. The only way to find them is to do, against the LLM-integrated layer, what an attacker would: probe, escalate, exfiltrate, abuse.

That's the third capability inside Continuous Offensive Security: Agent Red Teaming. Multi-step adversarial simulation against LLMs, AI Agents, and the tools they call. A tool purpose-built for the attack surface AI itself was created.

There's one thing I like the most about how it's wired: it isn't a separate scan you have to schedule, or a different product you have to buy. During an assessment, the recon agent detects whether the target includes LLM-integrated components, and if it does, the Red Teaming module triggers automatically. You don't have to know in advance what kind of attack surface your application presents; the system figures it out and runs the right tests against the right layer.

That matters more than it sounds at first. Most organizations now have AI somewhere in their app portfolios, but their security teams don't have a clear inventory of where. Recon-first, test-what-you-find, is the only model that scales when "Where is AI running in production?" is a continuously moving target.

So, that's the attack surface: flaws in traditional architectures and the new ground AI just created on top of them. The harder question is what it actually takes to do offensive testing well against both, continuously, at enterprise scale. Pointing an LLM at a target URL and letting it figure things out from cold is not the answer. Four things separate this from running blind.

Platform context

The naïve approach to AI pentesting is to start from scratch. Point the LLM at a URL, let it crawl, let it guess, let it burn compute on enumeration that doesn't matter, hope it eventually finds something. And worse: it has no way to distinguish a theoretical finding from one that’s actually exploitable in your stack, because it can’t see your code, your dependencies, your prior scans, your deployment environment, or your trust boundaries. That's the demo loop, but it's not how production-grade security testing should work.

The Snyk version of this looks different, though. Continuous Offensive Security starts with everything the platform already knows about your application: SAST findings, SCA dependencies, asset inventories, prior DAST scans, and risk signals from across the platform. All of it feeds the AI Pentester before it sends a single request.

That changes what the AI does on day one. Instead of "Figure out what this application is and find vulnerabilities," the instruction set becomes "This application has these components, these dependencies, these prior findings, these reachable endpoints, this risk profile, now go after what isn't already covered." The LLM stops guessing and starts working.

Straight from our announcement this week: "Snyk is different because we already know your code." That's the platform argument in seven words. Every other AI pentest tool starts from zero; we start from where your existing tooling left off.

This is also why I don't think pure point solutions in this category have a long runway, as you can't bolt this layer onto a fresh product. It requires a decade of accumulated context flowing in from engines that have been finding real vulnerabilities at real customers.

Hybrid Dynamic Testing and LLM detection

This is the engineering point that most new entrants gloss over, and it's where I think a lot of them are going to struggle when their funding runways and token bills meet in the middle.

The once-again naïve way to build AI Pentesting is: a pure LLM, point it at the target, let it figure everything out. That works in demos, but does not work in sustainable economics.

Every payload an LLM tries against an XSS endpoint costs tokens. Every brute-forced parameter, every variation, every retry? Tokens, tokens, tokens. Dynamic Testing does both of those exhaustively, deterministically, for pennies. Burning frontier-model tokens on the bug catalog is what "subsidized at negative margins" looks like, and the vendors running that play today are going to discover what unit economics mean the day they have to be profitable.

The architecturally correct model doesn't have two layers working in parallel, but rather an LLM acting as the brain behind the assessment, with Dynamic Testing as one of the tools at its disposal. When XSS, SQL injection, or misconfigurations need to be checked, the LLM doesn't enumerate payloads itself; it calls Dynamic Testing, which has been doing exactly that, exhaustively, deterministically, and for pennies, for years.

That leaves the LLM free to spend its tokens where only an LLM can: reasoning about business logic, finding authorization flaws, and chaining individual findings together into actual exploit paths. That's where the majority of the compute goes. The well-trodden ground stays well-trodden, deterministically, and the reasoning surface gets every dollar of compute it deserves.

That's the architecture from day one, not a roadmap we're working toward. It's why building this with no Dynamic Security Testing engine underneath it is, in my view, a much harder problem than it looks from the outside.

Attack narratives, not alert lists

The output of a traditional scanner is a list, with vulnerabilities, ranked by severity score, with stack traces or HTTP requests attached. Security teams have been drowning in these lists for several years now. CVSS 9.8 next to CVSS 9.6 next to CVSS 9.4... and somewhere in there, an actual exploitable path that combines two of them with a third low-severity finding the team triaged out months ago.

The context-dependent class vulnerabilities I described earlier almost never appear as standalone findings; they appear as combinations: an authorization gap on one endpoint, a logic flaw in how the API issues identifiers, and a stale session that the cleanup job missed. Those are three findings that look mundane in isolation but that work like a data breach, in combination.

Continuous Offensive Security doesn't ship findings as a list. It ships them as exploit chains, the actual path from initial access to impact, with the request/response trail attached as proof. Everything reproducible and auditable. A story about what an attacker could actually do to this application, not a 400-row spreadsheet someone has to triage on a Friday afternoon.

Colleen Carroll, Senior Director and Information Security Officer at Emburse, made the demand-side version of this point in our announcement:

“Security teams are looking for solutions that help them prioritize real risk, not just manage more alerts. Snyk’s Continuous Offensive Security gives teams clearer visibility into exploitable vulnerabilities and how they chain together, enabling them to move faster, reduce exposure, and support innovation with confidence.”

That's the difference between "Here is what we found" and "Here is how this can be used against you."

Enterprise AI Harness

The piece that requires serious engineering work and almost never makes it into the headlines is running AI agents responsibly against systems that are one network hop from production.

Governance, scope enforcement, persistent context across long-running operations, and reproducibility, so a finding can be validated and re-examined. Reasoning traces, so a security team can audit how a conclusion was reached. The whole layer that turns "an LLM trying things" into something an enterprise security org will actually run against an application.

We call it the AI Security Harness. It's the part of Continuous Offensive Security that sits beneath that headline. It also happens to be where most of the actual difficulty lives. And where, again, the lineage matters. Building this layer is harder than building the AI on top of it, and the orgs that have been running pentesting and Dynamic Security Testing at enterprise scale for years have a non-trivial head start on the people just now arriving.

It’s also where we made a deliberate architectural choice that solutions can’t easily replicate: Continuous Offensive Security doesn’t run on a single-frontier model. The AI security harness orchestrates multiple leading-edge models, defender-class models, and other open and proprietary models, tuned over time. This constitutes, for us, an architectural decision about how an enterprise-grade AI system should be built.

Continuous Offensive Security runs a multi-model offensive security system that is purpose-built for enterprise pentesting. Frontier models execute the assessment under Snyk's offensive harness; a dedicated validation model acts as an independent judge, confirming exploitability before any finding is surfaced; and Snyk's platform intelligence, grounds every attack in real application context. The result is a system tailored for precision over noise.

Where this lands

Gabriel Brolo, Staff Security Engineer at Yalo, put it more clearly than I could recently:

"The volume and pace of AI-generated code has fundamentally outpaced the pentesting model most of us have been running for years. We can't schedule our way out of a continuous risk surface. What we need is offensive testing that keeps up with how we actually build software today — with enough context to focus on what's genuinely exploitable, not just what's theoretically possible. This kind of capability will enable teams to have the power to better understand their threat landscape and actual risks for better mitigation."

For Snyk API & Web customers, Continuous Offensive Security is not a separate buy; rather, it's the next layer on the same outside-in testing path your Dynamic Security Testing already runs. Heuristic-detectable issues covered by the scanner, context-dependent covered by the AI, and the AI-specific attack surface covered by Red Teaming the moment recon spots an LLM in the stack. The same posture, scaled across what your applications are actually made of now.

For the market, the question stops being "Do you need AI Pentesting?" (and, well, we believe you do) and becomes "Whose AI Pentesting should I choose?"

As we're finally announcing Continuous Offensive Security this week, our intent is that the customer who is evaluating AI Pentesting solutions today, by weighing options, sitting through demos, and trying to figure out who to trust, knows that one of the answers comes from the same people who have been in this room since well before it was even a room.

If you've followed Manoj's post on the attack surface AI is building and my previous article on what dynamic testing has to become to keep up with it, this is the third beat, and you can clearly see the line connecting. It always did, and we had been hinting at it all along.

We'll have more to say at Black Hat, meet you all there!

You can’t govern AI you can’t see

Start with Discovery. Start with Evo AI-SPM.

Uncover every AI component hidden in your codebase and apply organization-wide governance.

Book a Demo