
Errors Found in OpenAI’s Blockchain Benchmark
OpenZeppelin audited OpenAI's EVMbench, finding errors and data issues.
The cybersecurity firm OpenZeppelin has audited OpenAI’s new AI benchmark, EVMbench. Experts identified methodological errors and data “contamination.”
— OpenZeppelin (@OpenZeppelin) March 2, 2026
The developer of ChatGPT launched EVMbench in mid-February in partnership with the investment fund Paradigm to assess the ability of AI agents to find, fix, and exploit vulnerabilities in smart contracts.
OpenZeppelin specialists welcomed the initiative but decided to evaluate the development using the same standards as protected protocols (including Aave, Lido, and Uniswap).
Key Shortcomings
The main issue is related to the “contamination” of training data. EVMbench is built on a selection of 120 vulnerabilities identified during audits in 2024-2025.
However, the leading models tested have a knowledge cut-off as of August 2025. These models could “recall” information about these vulnerabilities from training data. Even with the internet disabled, this calls into question the purity of the experiment: it is unclear whether AI can truly identify new threats.
OpenZeppelin also pointed out factual errors in the EVMbench dataset. At least four vulnerabilities classified as “high risk” were found to be non-functional. Yet, AI agents received correct scores for supposedly accurate detection of these issues.
“These are not subjective disagreements about severity; these are cases where the described attack simply does not work,” the experts emphasized.
Specialists confirmed that artificial intelligence will play a key role in the future of blockchain security. However, they warned that rushing implementation should not compromise the quality of data and tests.
“The question is not whether AI will change the security of smart contracts—it will. The question is whether the benchmarks and data on which we build these tools will meet the same standards as the contracts they are meant to protect,” concluded OpenZeppelin.
Back in November, Microsoft experts introduced a testing environment for AI agents and identified vulnerabilities inherent in modern digital assistants.
Рассылки ForkLog: держите руку на пульсе биткоин-индустрии!