OpenZeppelin Uncovers Critical Flaws in OpenAI's EVMbench Dataset

OpenZeppelin Uncovers Critical Flaws in OpenAI's EVMbench Dataset

Blockchain security firm OpenZeppelin has identified training data leaks and a minimum of four incorrectly labeled high-severity vulnerabilities in the EVMbench dataset.

Security specialists at OpenZeppelin, a prominent blockchain security company, have disclosed the discovery of significant methodological problems and data contamination issues following their comprehensive audit of EVMbench, OpenAI's recently introduced artificial intelligence-powered blockchain security benchmark system.

The EVMbench platform was unveiled in a collaborative effort with Paradigm, a cryptocurrency investment firm, during the middle of February. The benchmark was designed and developed to assess the capabilities of various artificial intelligence models in their ability to detect vulnerabilities in smart contracts, apply patches to fix them, and potentially exploit these security weaknesses.

Through a statement published on the X platform this Monday, OpenZeppelin expressed appreciation for the initiative while revealing that the company had recently chosen to subject EVMbench to "the same scrutiny" that characterizes its approach to securing all protocols under its protection, a portfolio that encompasses major decentralized finance platforms such as Aave, Lido and Uniswap.

Based on the findings from its comprehensive audit, OpenZeppelin reported the identification of two primary concerns: contamination of training data and problems with the classification of multiple high-severity security vulnerabilities.

"We reviewed the dataset and identified methodological flaws and invalid vulnerability classifications including at least four issues labeled high severity that are not exploitable in practice," OpenZeppelin said.

image.png
Source: OpenZeppelin

When EVMbench was released to the public, it included an assessment evaluating the theoretical capabilities of AI agents in exploiting vulnerabilities found in smart contracts. Among the tested models, Anthropic's Claude Open 4.6 achieved the highest ranking, with OpenAI's OC-GPT-5.2 securing second place and Google's Gemini 3 Pro coming in third.

EVMbench testing may need revising

Examining the primary concern regarding data contamination, OpenZeppelin emphasized that the most critical capability for "AI security is finding novel vulnerabilities in code the model has never seen before."

Nevertheless, throughout EVMbench's evaluation process of AI agents, OpenZeppelin discovered that the AI agents achieving the top scores had "likely been exposed to the benchmark's vulnerability reports during pretraining."

During EVMbench's testing procedures, internet connectivity was disabled for the AI agents, which prevented them from simply conducting online searches to find answers to the challenges presented. Despite this precaution, the benchmark had been constructed using carefully selected vulnerabilities extracted from 120 audits conducted between 2024 and mid-2025, while the knowledge training cutoffs for the AI agents being evaluated were typically set at mid-2025.

Consequently, there existed a substantial possibility that the AI agents possessed prior knowledge of all the solutions to the presented challenges stored within their trained memory banks.

"While this does not necessarily enable the model to identify the issue immediately, it reduces the quality of the test. The dataset's limited size further narrows the evaluation surface, making these contamination concerns more significant," OpenZeppelin said.

As a final point, OpenZeppelin highlighted the presence of substantial factual inaccuracies within EVMbench's dataset, contending that multiple "high-severity vulnerabilities" were classified incorrectly and deemed invalid.

According to OpenZeppelin's assessment, the team examined a minimum of four vulnerabilities that EVMbench had designated with a high-risk classification, yet these vulnerabilities cannot actually be exploited in practical scenarios. Despite this, EVMbench had been awarding positive scores to AI agents for successfully identifying these purportedly false vulnerabilities.

"These aren't subjective severity disagreements, they are findings where the described exploit doesn't work."

In conclusion, OpenZeppelin reinforced its position that artificial intelligence technologies will undoubtedly play a crucial role in strengthening blockchain security measures, while simultaneously emphasizing the critical importance of implementing the technology and conducting evaluations in appropriate methodologies to fully realize its potential benefits.

"The question isn't whether AI will transform smart contract security — it will. The question is whether the data and benchmarks we use to build and evaluate these tools are held to the same standard as the contracts they're meant to protect."