The Exploit Race

January 14, 2026
Quantstamp Announcements

The Exploit Race

Web3 is different from “normal software” for one brutal reason: bugs turn directly into money. In 2025 alone, an estimated $3.4B was stolen through crypto exploits [1]. That incentive creates a uniquely hostile environment where attackers systematize vulnerability search [2].

Recent research by Anthropic shows what happens when exploitation becomes automated: agents can repeatedly test at scale with rapidly falling cost, and they surfaced previously unknown issues during large-scale evaluation [3].

When attacks scale this cheaply, obvious bugs are exploited immediately. What remains is a race: how quickly defenders adapt, and how effectively agents find what humans missed.

Scaling Defense

If attackers scale, defenders must scale too. That’s why AI security agents are quickly becoming part of the baseline. But the current landscape is noisy: benchmarks disagree, performance claims are difficult to compare, and teams struggle to evaluate which agents to trust and how to fit them into the dev cycle.

Our view is simple: if agents are going to become part of the security baseline, they must be evaluated like infrastructure. 

At Quantstamp, we’ve built a dedicated team of AI security engineers focused on focused on benchmarking agents in realistic environments and auditing agentic systems end-to-end, from model behavior to deployment. This work is supported by research grants from OpenAI and Anthropic.

This post (and evolving blog series) distills what we’ve learned so far: why agents are necessary, what they actually do, where they work well, and where they still fail.

Prompting LLMs is Unreliable

Most web3 developers have tried asking an LLM to “find vulnerabilities”. And it does help: models are often good at catching textbook issues, like missing access control checks or suspicious external calls. As a fast hygiene check, this is genuinely useful.

But smart contracts are a harsher environment than most software. Smart contract bugs are often easy to verify once suspected, but hard to discover in the first place. Attackers don’t need mainnet to do the discovery; they can simulate contracts and chain state locally, iterate across state variations, and then only execute a proven path on-chain. 

In that world, it’s not enough for an LLM to be occasionally insightful; you need outputs that are consistent and repeatable.

“Raw prompting” usually fails in two predictable ways:

This is where “LLM-as-a-prompt” hits a ceiling: it’s a strong assistant, but not a security workflow. Agents exist to make this operational - standardized prompts, scoped context, multi-pass checks, and verification loops - so results are reproducible, comparable over time, and steadily improvable.

Turning AI into a Security Workflow

A security agent is basically an LLM wrapped in a repeatable process: it decides what to check, runs multiple targeted passes, pulls only the context it needs, uses tools when needed, and tries to validate findings before reporting them. 

A single prompt - no matter how well written - forces the model to choose everything at once: what matters, what doesn’t, which hypotheses to pursue, and when to stop. Agents shift that balance. You define the workflow (checks, sequencing, verification, reporting), and the model keeps flexibility where it’s actually valuable: interpreting code, generating hypotheses, and adapting based on evidence.

Optimizing an agent

A single prompt can already be structured: you can chain prompts, add checklists, and iterate manually. The point of an agent is that this structure becomes systematic: a workflow you can run consistently, benchmark, and evolve. This way, results depend on a process you control.

That’s why there’s no single “right” way to build an agent. Most teams in this space are experimenting with a big design space: which strategies to use, in what order, with what tools, and with what stopping criteria. The same base model can behave very differently depending on these choices. [4][6]

Here are the main levers agents are built from:

However, even the best agents still have blind spots, and they’re the ones that tend to matter most in high-value protocols.

Current Landscape & Benchmarks

Now that we’ve established why agents are the right direction for web3 security, the obvious question is: which agents can you actually trust? That’s where the landscape gets messy, but it’s surprisingly hard to measure performance in a way that’s fair and comparable.

One unexpected divergence is the benchmark evaluator. Some public evaluators score a finding as a true positive because it looks “close enough” to something in the ground truth - even if the location is different or the attack scenario doesn’t match. That sounds reasonable until you look closer: a finding isn’t just a label, it’s a specific claim: this function, this path, these preconditions, this is how value moves. If an agent flags reentrancy in the wrong place, it’s still a false alarm for the developer… even if the contract happens to have a real reentrancy bug somewhere else. Such evaluators quietly inflate scores and make tools look more interchangeable than they actually are.

That’s why we built our own evaluator: to get the right metrics for comparing agents and comprehend publicly claimed benchmarks.

With cleaner evaluation, the first observation is obvious: agents are optimized for different users. Some aim for precision (fewer false positives), which is great for developers who need a signal they can act on quickly. Others aim for recall (catch more real issues) and tolerate more noise. This is often better for security teams who are willing to validate and dig deeper. Over time, we expect the best agents to push both higher, but right now it’s still a real tradeoff.

It’s also not “one agent fits all” (yet). Performance shifts with ecosystem, contract style, and protocol complexity. An agent that looks strong on common, well-known patterns (think standard token/NFT implementations) can degrade fast when you move to complex systems like lending protocols. We have even observed them to fall apart entirely on novel designs where there isn’t much prior data, and the attack surface depends on deeper reasoning (e.g., unusual account abstraction setups). In other words, today’s agents are good at pattern spotting and sanity checks, and much less reliable when the job is to find creative attack vectors or perform an exhaustive audit.

To get a clearer signal, we benchmarked multiple accessible agents using our own evaluation system on a contract set we believe is out-of-distribution for current models, bucketed into Easy / Medium / Hard. We’ll likely publish deeper numbers later, but the qualitative takeaway is already useful: tools are not converging to one “best agent.” They’re differentiating by noise tolerance, by workflow quality, and by which protocol families they handle well.

So the path forward isn’t more hype or more one-off leaderboards. It’s independent, verifiable benchmarking that makes the tradeoffs explicit: precision vs recall, evidence-backed vs pattern-only findings, and strengths by protocol type and ecosystem. Ideally, developers shouldn’t need to become benchmark experts just to pick a tool. They should have clear, comparable, reproducible statistics that make AI security feel like dependable infrastructure.

Advice for Devs

At the current stage of AI security agents, developers can get real value if they treat them like a CI check: they surface leads continuously, but they don’t certify safety. Agents help you keep up by making vulnerability discovery cheap and repeatable on the defender side too. Our best practice advice: turn them into a routine of consistent runs, triage, and validation.

Here’s a workflow that we’ve found useful:

Done right, this workflow removes low-hanging bugs quickly and saves human time for what agents still miss: economic edge cases, design flaws, and spec–implementation mismatches. Use agents to raise your baseline so auditors can focus on non-textbook risk.

We’re here to help

AI agents are becoming the new baseline for smart contract development. Used well, they reduce attack surface before audits even begin. But as the exploit race accelerates, the uncomfortable truth remains: the failures that cost the most are still the ones agents miss - design flaws, economic edge cases, and spec ↔ implementation mismatches that only show up when you threat-model the full system.

That’s where we focus:

The strongest security posture today is layered: agents for speed and coverage, and experienced auditors for adversarial reasoning and assurance where it matters most.

Request an audit, and we’ll help you design that workflow – and ship with confidence.

Sources

[1] https://www.chainalysis.com/blog/crypto-hacking-stolen-funds-2026
[2] https://www.chainalysis.com/blog/organized-crime-crypto/
[3] https://red.anthropic.com/2025/smart-contracts/
[4] Logic Meets Magic: LLMs Cracking Smart Contract Vulnerabilities — https://arxiv.org/html/2501.07058v1
[5] LLM-SmartAudit: Advanced Smart Contract Vulnerability Detection — https://arxiv.org/abs/2410.09381

[6] Detection Made Easy: Potentials of Large Language Models for Solidity Vulnerabilities — https://arxiv.org/html/2409.10574v2
[7] SCALM: Detecting Bad Practices in Smart Contracts Through LLMs — https://arxiv.org/abs/2502.04347
[8] CKG-LLM: LLM-Assisted Detection of Smart Contract Access Control Vulnerabilities Based on Knowledge Graphs — https://arxiv.org/html/2512.06846v1
[9] LLM-BSCVM: Large Language Model-Based Smart Contract Vulnerability Management Framework — (use the URL you already cite in your draft, or send it and I’ll format it consistently)
[10] SmartLLM: Smart Contract Auditing using Custom Generative AI — https://arxiv.org/abs/2406.09677
[11] Vul-RAG: Leveraging Knowledge-Level RAG for Vulnerability Detection — (use the URL you already cite in your draft, or send it and I’ll format it consistently)
[12] Prompt to Pwn (ReX): Automated Exploit Generation for Smart Contracts — https://arxiv.org/abs/2408.04556
[13] “Identifying Smart Contract Security Issues in Code Snippets from Stack Overflow” (introduces SOChecker) — arXiv:2407.13271 

Quantstamp Announcements
January 14, 2026

The Exploit Race

Web3 is different from “normal software” for one brutal reason: bugs turn directly into money. In 2025 alone, an estimated $3.4B was stolen through crypto exploits [1]. That incentive creates a uniquely hostile environment where attackers systematize vulnerability search [2].

Recent research by Anthropic shows what happens when exploitation becomes automated: agents can repeatedly test at scale with rapidly falling cost, and they surfaced previously unknown issues during large-scale evaluation [3].

When attacks scale this cheaply, obvious bugs are exploited immediately. What remains is a race: how quickly defenders adapt, and how effectively agents find what humans missed.

Scaling Defense

If attackers scale, defenders must scale too. That’s why AI security agents are quickly becoming part of the baseline. But the current landscape is noisy: benchmarks disagree, performance claims are difficult to compare, and teams struggle to evaluate which agents to trust and how to fit them into the dev cycle.

Our view is simple: if agents are going to become part of the security baseline, they must be evaluated like infrastructure. 

At Quantstamp, we’ve built a dedicated team of AI security engineers focused on focused on benchmarking agents in realistic environments and auditing agentic systems end-to-end, from model behavior to deployment. This work is supported by research grants from OpenAI and Anthropic.

This post (and evolving blog series) distills what we’ve learned so far: why agents are necessary, what they actually do, where they work well, and where they still fail.

Prompting LLMs is Unreliable

Most web3 developers have tried asking an LLM to “find vulnerabilities”. And it does help: models are often good at catching textbook issues, like missing access control checks or suspicious external calls. As a fast hygiene check, this is genuinely useful.

But smart contracts are a harsher environment than most software. Smart contract bugs are often easy to verify once suspected, but hard to discover in the first place. Attackers don’t need mainnet to do the discovery; they can simulate contracts and chain state locally, iterate across state variations, and then only execute a proven path on-chain. 

In that world, it’s not enough for an LLM to be occasionally insightful; you need outputs that are consistent and repeatable.

“Raw prompting” usually fails in two predictable ways:

This is where “LLM-as-a-prompt” hits a ceiling: it’s a strong assistant, but not a security workflow. Agents exist to make this operational - standardized prompts, scoped context, multi-pass checks, and verification loops - so results are reproducible, comparable over time, and steadily improvable.

Turning AI into a Security Workflow

A security agent is basically an LLM wrapped in a repeatable process: it decides what to check, runs multiple targeted passes, pulls only the context it needs, uses tools when needed, and tries to validate findings before reporting them. 

A single prompt - no matter how well written - forces the model to choose everything at once: what matters, what doesn’t, which hypotheses to pursue, and when to stop. Agents shift that balance. You define the workflow (checks, sequencing, verification, reporting), and the model keeps flexibility where it’s actually valuable: interpreting code, generating hypotheses, and adapting based on evidence.

Optimizing an agent

A single prompt can already be structured: you can chain prompts, add checklists, and iterate manually. The point of an agent is that this structure becomes systematic: a workflow you can run consistently, benchmark, and evolve. This way, results depend on a process you control.

That’s why there’s no single “right” way to build an agent. Most teams in this space are experimenting with a big design space: which strategies to use, in what order, with what tools, and with what stopping criteria. The same base model can behave very differently depending on these choices. [4][6]

Here are the main levers agents are built from:

However, even the best agents still have blind spots, and they’re the ones that tend to matter most in high-value protocols.

Current Landscape & Benchmarks

Now that we’ve established why agents are the right direction for web3 security, the obvious question is: which agents can you actually trust? That’s where the landscape gets messy, but it’s surprisingly hard to measure performance in a way that’s fair and comparable.

One unexpected divergence is the benchmark evaluator. Some public evaluators score a finding as a true positive because it looks “close enough” to something in the ground truth - even if the location is different or the attack scenario doesn’t match. That sounds reasonable until you look closer: a finding isn’t just a label, it’s a specific claim: this function, this path, these preconditions, this is how value moves. If an agent flags reentrancy in the wrong place, it’s still a false alarm for the developer… even if the contract happens to have a real reentrancy bug somewhere else. Such evaluators quietly inflate scores and make tools look more interchangeable than they actually are.

That’s why we built our own evaluator: to get the right metrics for comparing agents and comprehend publicly claimed benchmarks.

With cleaner evaluation, the first observation is obvious: agents are optimized for different users. Some aim for precision (fewer false positives), which is great for developers who need a signal they can act on quickly. Others aim for recall (catch more real issues) and tolerate more noise. This is often better for security teams who are willing to validate and dig deeper. Over time, we expect the best agents to push both higher, but right now it’s still a real tradeoff.

It’s also not “one agent fits all” (yet). Performance shifts with ecosystem, contract style, and protocol complexity. An agent that looks strong on common, well-known patterns (think standard token/NFT implementations) can degrade fast when you move to complex systems like lending protocols. We have even observed them to fall apart entirely on novel designs where there isn’t much prior data, and the attack surface depends on deeper reasoning (e.g., unusual account abstraction setups). In other words, today’s agents are good at pattern spotting and sanity checks, and much less reliable when the job is to find creative attack vectors or perform an exhaustive audit.

To get a clearer signal, we benchmarked multiple accessible agents using our own evaluation system on a contract set we believe is out-of-distribution for current models, bucketed into Easy / Medium / Hard. We’ll likely publish deeper numbers later, but the qualitative takeaway is already useful: tools are not converging to one “best agent.” They’re differentiating by noise tolerance, by workflow quality, and by which protocol families they handle well.

So the path forward isn’t more hype or more one-off leaderboards. It’s independent, verifiable benchmarking that makes the tradeoffs explicit: precision vs recall, evidence-backed vs pattern-only findings, and strengths by protocol type and ecosystem. Ideally, developers shouldn’t need to become benchmark experts just to pick a tool. They should have clear, comparable, reproducible statistics that make AI security feel like dependable infrastructure.

Advice for Devs

At the current stage of AI security agents, developers can get real value if they treat them like a CI check: they surface leads continuously, but they don’t certify safety. Agents help you keep up by making vulnerability discovery cheap and repeatable on the defender side too. Our best practice advice: turn them into a routine of consistent runs, triage, and validation.

Here’s a workflow that we’ve found useful:

Done right, this workflow removes low-hanging bugs quickly and saves human time for what agents still miss: economic edge cases, design flaws, and spec–implementation mismatches. Use agents to raise your baseline so auditors can focus on non-textbook risk.

We’re here to help

AI agents are becoming the new baseline for smart contract development. Used well, they reduce attack surface before audits even begin. But as the exploit race accelerates, the uncomfortable truth remains: the failures that cost the most are still the ones agents miss - design flaws, economic edge cases, and spec ↔ implementation mismatches that only show up when you threat-model the full system.

That’s where we focus:

The strongest security posture today is layered: agents for speed and coverage, and experienced auditors for adversarial reasoning and assurance where it matters most.

Request an audit, and we’ll help you design that workflow – and ship with confidence.

Sources

[1] https://www.chainalysis.com/blog/crypto-hacking-stolen-funds-2026
[2] https://www.chainalysis.com/blog/organized-crime-crypto/
[3] https://red.anthropic.com/2025/smart-contracts/
[4] Logic Meets Magic: LLMs Cracking Smart Contract Vulnerabilities — https://arxiv.org/html/2501.07058v1
[5] LLM-SmartAudit: Advanced Smart Contract Vulnerability Detection — https://arxiv.org/abs/2410.09381

[6] Detection Made Easy: Potentials of Large Language Models for Solidity Vulnerabilities — https://arxiv.org/html/2409.10574v2
[7] SCALM: Detecting Bad Practices in Smart Contracts Through LLMs — https://arxiv.org/abs/2502.04347
[8] CKG-LLM: LLM-Assisted Detection of Smart Contract Access Control Vulnerabilities Based on Knowledge Graphs — https://arxiv.org/html/2512.06846v1
[9] LLM-BSCVM: Large Language Model-Based Smart Contract Vulnerability Management Framework — (use the URL you already cite in your draft, or send it and I’ll format it consistently)
[10] SmartLLM: Smart Contract Auditing using Custom Generative AI — https://arxiv.org/abs/2406.09677
[11] Vul-RAG: Leveraging Knowledge-Level RAG for Vulnerability Detection — (use the URL you already cite in your draft, or send it and I’ll format it consistently)
[12] Prompt to Pwn (ReX): Automated Exploit Generation for Smart Contracts — https://arxiv.org/abs/2408.04556
[13] “Identifying Smart Contract Security Issues in Code Snippets from Stack Overflow” (introduces SOChecker) — arXiv:2407.13271 

Quantstamp Announcements

Engineering Smart Contract Families for Solidity

Decentralized applications (dApps) (e.g., DEXes) increasingly span multiple Ethereum-compatible chains, such as a number of L2s. Although these chains are intended to be compatible with the Ethereum Virtual Machine (EVM), subtle differences in opcode implementations can significantly alter smart contract behavior and security. This poses an important question: how can developers efficiently code and manage smart contracts targeting different chains?

Read more
Quantstamp Announcements

Will EIP-7702 Affect Your Code?

The upcoming EVM hardfork, Pectra, amongst other changes, will implement EIP-7702, a proposal introducing a new transaction type that allows Externally Owned Accounts (EOAs) to delegate—and later undelegate—their behavior to smart contracts. While this upgrade enhances flexibility, it also disrupts long-standing security assumptions in many deployed contracts. With the risk that malicious actors may exploit these changes once Pectra is enabled, it is crucial to assess whether your codebase might be negatively impacted.

Read more
Quantstamp Announcements

When AI Meets Blockchain: A Guide to Securing the Next Frontier

In recent months, AI agents have attracted significant attention by the promise of assisting users and automating complex processes across diverse applications. The rapid performance improvements of Large Language Models (LLMs) in natural language processing (NLP) tasks drive this trend. However, as the capabilities and reach of these agents expand, so do the risks. The rapid pace of development, combined with the intricacies of integrating LLMs into real-world infrastructures—especially in dynamic fields like blockchain—has created an urgent need to scrutinize them for security, compliance, and operational integrity.

Read more