AI Fails Real-World Smart Contract Exploits in BlockSec Benchmark Retest

BlockSec just poured some cold water on one of the louder narratives in crypto security right now.

Not killed it.
But definitely checked it.

The claim was simple: AI could soon automate smart contract auditing end-to-end. That idea got real traction after OpenAI and Paradigm introduced EVMBench, a benchmark designed to test how well AI agents can detect, patch, and even exploit vulnerabilities in smart contracts.

The headline numbers were strong.

AI agents exploited 72% of vulnerabilities.
They detected around 45%.

That’s the kind of stat that makes people start talking about replacing auditors.

BlockSec wasn’t convinced.

The Retest That Changed the Tone

Researchers at BlockSec ran their own version of the benchmark. Same idea. Different setup.

They expanded the number of model configurations from 14 to 26. Mixed models with different scaffolds. Tested combinations like Claude running on ChatGPT-style agent frameworks.

Why?

Because, as they put it, you can’t tell whether performance comes from the model itself or the structure around it.

Fair point.

Then they did something more important.

They changed the data.

Instead of testing known vulnerabilities from Code4rena repositories—which may already be floating around in training datasets—they pulled in 22 real-world incidents that happened after mid-February 2026.

Fresh exploits. No training leakage.

That’s where things broke.

Exploitation Rate: From 72% to Zero

BlockSec tested 110 agent-incident pairs.

Five agents.
Twenty-two real-world exploits.

Result?

Zero successful end-to-end exploitations.

Not “lower performance.”
Not “slightly worse.”

Zero.

That’s not a rounding error. That’s a gap.

In my experience, when results swing that hard after changing the dataset, the issue isn’t incremental—it’s structural.

Detection Still Works… Kind Of

To be fair, detection didn’t collapse the same way.

BlockSec found that AI models still performed reasonably well at spotting known vulnerability patterns. Claude Opus 4.6 led the group, detecting 13 out of 20 real-world issues.

And there’s a pattern here.

Some vulnerabilities were picked up by almost every agent—things like:

unchecked multiplication overflow
reserve manipulation patterns

Others?

Missed completely.

Four incidents weren’t detected by any model.
Five were caught by only one agent.

That’s not randomness. That’s brittleness.

What This Actually Means

This is where a lot of people get it wrong.

The takeaway isn’t:

“AI doesn’t work.”

It’s this:

AI is very good at pattern recognition.
It’s still weak at adversarial reasoning.

Those are not the same skill.

Spotting a known exploit signature is one thing.
Understanding how a new attack might unfold inside a complex protocol is something else entirely.

And smart contract exploits are rarely clean textbook cases. They’re messy. Context-heavy. Sometimes weirdly creative.

That’s where humans still dominate.

The Data Contamination Problem

BlockSec also raised a point that’s been quietly bothering a lot of researchers.

If you test AI on known vulnerabilities—especially ones published in public audit repos—there’s a non-trivial chance the model has already seen similar patterns during training.

Not memorization in the simple sense. But familiarity.

That inflates results.

By switching to post-2026 incidents, BlockSec effectively removed that advantage. And the performance drop tells you how much it mattered.

Why This Matters for the Industry

There’s been a growing narrative that AI auditing is right around the corner. Fully automated. Faster. Cheaper. Scalable.

This pushes back on that.

Hard.

Even BlockSec isn’t dismissing EVMBench. In fact, they called it a valuable contribution. It gives the industry a standardized way to evaluate models.

But the conclusion is much more grounded.

AI is not replacing auditors anytime soon.

Not close.

The Real Model: Human + Machine

Yajin Zhou, BlockSec’s co-founder, framed it pretty cleanly.

AI handles breadth.
Humans handle depth.

That tracks.

AI can scan large codebases quickly. Flag patterns. Surface potential issues. Do the repetitive work that slows human auditors down.

But when it comes to:

understanding protocol design
anticipating attacker behavior
reasoning through edge cases

That still sits with humans.

And honestly, that’s not surprising.

If you’ve ever looked at a real exploit post-mortem, you know they don’t look like clean textbook examples. They look like someone found a weird edge and pushed it until it broke.

Machines aren’t great at that yet.

Where This Leaves AI Auditing

AI auditing has real value. That part isn’t in question.

It improves coverage.
Speeds up reviews.
Reduces obvious misses.

But the idea that it can run end-to-end audits—or worse, autonomously exploit contracts in the wild—doesn’t hold up under more realistic testing.

At least not yet.

Maybe that changes in a few years.

But right now?

It’s a tool. Not a replacement.

And if anything, this study clarifies the direction.

Not AI vs humans.

AI with humans.

That’s the only setup that actually works today.

Disclaimer

This article is for informational and educational purposes only and does not constitute financial, investment, trading, or legal advice. Cryptocurrencies, memecoins, and prediction-market positions are highly speculative and involve significant risk, including the potential loss of all capital.

The analysis presented reflects the author’s opinion at the time of writing and is based on publicly available information, on-chain data, and market observations, which may change without notice. No representation or warranty is made regarding accuracy, completeness, or future performance.

Readers are solely responsible for their investment decisions and should conduct their own independent research and consult a qualified financial professional before engaging in any trading or betting activity. The author and publisher hold no responsibility for any financial losses incurred.

Post Views: 11

AI Fails Real-World Smart Contract Exploits in BlockSec Benchmark Retest

ByShane Neagle

The Retest That Changed the Tone

Exploitation Rate: From 72% to Zero

Detection Still Works… Kind Of

What This Actually Means

The Data Contamination Problem

Why This Matters for the Industry

The Real Model: Human + Machine

Where This Leaves AI Auditing

By Shane Neagle

Related Post

Bitcoin ETFs vs Gold ETFs: Why the Next Capital Rotation May Redefine ‘Store of Value’

Onchain Commodity Markets Are Growing Fast — But Liquidity Still Defines the Ceiling

The CLARITY Act and the Battle for Crypto’s Future: Decentralization vs Institutional Control

Leave a Reply Cancel reply

لا يفوتك

ECB Backs ESMA Power Shift: Europe Moves Toward Centralized Crypto Supervision

Bitcoin ETFs vs Gold ETFs: Why the Next Capital Rotation May Redefine ‘Store of Value’

Kalshi vs States: The Court Battle That Could Redefine Prediction Markets in the US

Polymarket’s V2 Overhaul: Infrastructure Control, Collateral Design, and the Push Toward Regulation