BlockSec just poured some cold water on one of the louder narratives in crypto security right now.
Not killed it.
But definitely checked it.
The claim was simple: AI could soon automate smart contract auditing end-to-end. That idea got real traction after OpenAI and Paradigm introduced EVMBench, a benchmark designed to test how well AI agents can detect, patch, and even exploit vulnerabilities in smart contracts.
The headline numbers were strong.
AI agents exploited 72% of vulnerabilities.
They detected around 45%.
That’s the kind of stat that makes people start talking about replacing auditors.
BlockSec wasn’t convinced.
The Retest That Changed the Tone
Researchers at BlockSec ran their own version of the benchmark. Same idea. Different setup.
They expanded the number of model configurations from 14 to 26. Mixed models with different scaffolds. Tested combinations like Claude running on ChatGPT-style agent frameworks.
Why?
Because, as they put it, you can’t tell whether performance comes from the model itself or the structure around it.
Fair point.
Then they did something more important.
They changed the data.
Instead of testing known vulnerabilities from Code4rena repositories—which may already be floating around in training datasets—they pulled in 22 real-world incidents that happened after mid-February 2026.
Fresh exploits. No training leakage.
That’s where things broke.
Exploitation Rate: From 72% to Zero
BlockSec tested 110 agent-incident pairs.
Five agents.
Twenty-two real-world exploits.
Result?
Zero successful end-to-end exploitations.
Not “lower performance.”
Not “slightly worse.”
Zero.
That’s not a rounding error. That’s a gap.
In my experience, when results swing that hard after changing the dataset, the issue isn’t incremental—it’s structural.
Detection Still Works… Kind Of
To be fair, detection didn’t collapse the same way.
BlockSec found that AI models still performed reasonably well at spotting known vulnerability patterns. Claude Opus 4.6 led the group, detecting 13 out of 20 real-world issues.
And there’s a pattern here.
Some vulnerabilities were picked up by almost every agent—things like:
- unchecked multiplication overflow
- reserve manipulation patterns
Others?
Missed completely.
Four incidents weren’t detected by any model.
Five were caught by only one agent.
That’s not randomness. That’s brittleness.
What This Actually Means
This is where a lot of people get it wrong.
The takeaway isn’t:
“AI doesn’t work.”
It’s this:
AI is very good at pattern recognition.
It’s still weak at adversarial reasoning.
Those are not the same skill.
Spotting a known exploit signature is one thing.
Understanding how a new attack might unfold inside a complex protocol is something else entirely.
And smart contract exploits are rarely clean textbook cases. They’re messy. Context-heavy. Sometimes weirdly creative.
That’s where humans still dominate.
The Data Contamination Problem
BlockSec also raised a point that’s been quietly bothering a lot of researchers.
If you test AI on known vulnerabilities—especially ones published in public audit repos—there’s a non-trivial chance the model has already seen similar patterns during training.
Not memorization in the simple sense. But familiarity.
That inflates results.
By switching to post-2026 incidents, BlockSec effectively removed that advantage. And the performance drop tells you how much it mattered.
Why This Matters for the Industry
There’s been a growing narrative that AI auditing is right around the corner. Fully automated. Faster. Cheaper. Scalable.
This pushes back on that.
Hard.
Even BlockSec isn’t dismissing EVMBench. In fact, they called it a valuable contribution. It gives the industry a standardized way to evaluate models.
But the conclusion is much more grounded.
AI is not replacing auditors anytime soon.
Not close.
The Real Model: Human + Machine
Yajin Zhou, BlockSec’s co-founder, framed it pretty cleanly.
AI handles breadth.
Humans handle depth.
That tracks.
AI can scan large codebases quickly. Flag patterns. Surface potential issues. Do the repetitive work that slows human auditors down.
But when it comes to:
- understanding protocol design
- anticipating attacker behavior
- reasoning through edge cases
That still sits with humans.
And honestly, that’s not surprising.
If you’ve ever looked at a real exploit post-mortem, you know they don’t look like clean textbook examples. They look like someone found a weird edge and pushed it until it broke.
Machines aren’t great at that yet.
Where This Leaves AI Auditing
AI auditing has real value. That part isn’t in question.
It improves coverage.
Speeds up reviews.
Reduces obvious misses.
But the idea that it can run end-to-end audits—or worse, autonomously exploit contracts in the wild—doesn’t hold up under more realistic testing.
At least not yet.
Maybe that changes in a few years.
But right now?
It’s a tool. Not a replacement.
And if anything, this study clarifies the direction.
Not AI vs humans.
AI with humans.
That’s the only setup that actually works today.
Disclaimer
This article is for informational and educational purposes only and does not constitute financial, investment, trading, or legal advice. Cryptocurrencies, memecoins, and prediction-market positions are highly speculative and involve significant risk, including the potential loss of all capital.
The analysis presented reflects the author’s opinion at the time of writing and is based on publicly available information, on-chain data, and market observations, which may change without notice. No representation or warranty is made regarding accuracy, completeness, or future performance.
Readers are solely responsible for their investment decisions and should conduct their own independent research and consult a qualified financial professional before engaging in any trading or betting activity. The author and publisher hold no responsibility for any financial losses incurred.
