March 26, 2025

Releasing ALMX-1.5 and the Web3 Security Atlas

Today we release the first version of the Web3 Security Atlas (W3SA), an open-source initiative led by Almanax aimed at improving Web3 security with AI. This first release includes a benchmarking suite for blockchain code vulnerabilities. It initially focuses on EVM smart contracts written in Solidity and Solana programs written in Rust, with a second launch planned for Stellar and Aptos smart contracts.

A primary motivation for creating the Web3 Security Atlas is the need to overcome the limitations of current benchmark datasets. These are Solidity-specific, limited in size, often too simplistic and outdated. For instance, Smartbugs, a well-known curated benchmark, contains test cases that are overly simple and include verbal hints (like function names that indicate the vulnerability). They are appropriate for Static Analysis tool testing, but not recommended for other uses. The hints make them unsuitable for AI model testing, as the models can simply pick up on these cues, rather than learning to identify the underlying vulnerabilities.

Additionally, many of these datasets consist of toy examples, often comprising a single simple smart contract, which does not accurately represent the complexity found in real-world projects. Despite these limitations, we utilized these datasets for our initial benchmarking efforts and obtained state-of-the-art performance on both detection and false positive rates towards the end of 2024. ALMX-1, our first AI model, beat any other LLM model and static analysis tool at the task of identifying smart contract vulnerabilities.

To address these limitations and take this effort to the next level, we take inspiration from the broader AI industry by adopting benchmarking practices to the Web3 security field. Just as AI research relies on specialized and general-purpose benchmarks to rigorously assess model capabilities, we aim to establish benchmarks that reflect the real-world complexity of Web3 security vulnerabilities.

We strived to create benchmarks with the following characteristics:

Completeness: the data encompasses a broad spectrum of bug types with varying levels of severity.
Realism: we utilize complete repositories of authentic, deployed, and audited codebases.The tools are compared against scanning the entire repositories, and not just single files.
Robustness: we incorporate challenging examples to rigorously evaluate performance and accommodate advancements in AI models in the coming years.

As an example, our Solidity codebase benchmark includes 92 Critical/High/Medium severity bugs, from 9 different bug categories, and distributed in 8 real world codebases containing hundreds of files.

Data collection process

We assembled these benchmarks by gathering open source audit reports and corresponding repositories at the same commit hash. We manually review and refine datapoints to ensure high quality ground truths. Wherever necessary, we carefully inject bugs to increase the signal of the benchmark.

At evaluation time, we measure the detection rate of different models. This metric is defined as the number of unique High or Medium severity findings that the tool caught. We compare this metric for base AI models (including O1, Claude, and GPT-4o), Static Analysis tools (Slither and Radar), and our new fine tuned model, ALMX-1.5.

ALMX-1.5

ALMX-1.5 is our most recent model release. It is a model designed to navigate large scale and complex repositories, to perform high effort reasoning across multi-file execution paths, while empowered with the ability to consult the project documentation and navigate the internet. It supports most commonly used programming languages.

Our benchmarks

We are releasing 4 benchmarks to start:

Solidity Codebase Benchmark: A real world-scale benchmark made of hard bugs and complex Solidity repositories.
Solidity Access Control Benchmark: A set of test cases targeting access control bugs.
Solana Codebase Benchmark: A real world-scale benchmark made of hard bugs and complex Solana Programs repositories.
Solana Insecure Programs: A set of test cases targeting common Solana bugs.

What’s Next:
In the upcoming months, we will release similar benchmarks for the Stellar and Aptos ecosystems. Additionally, we plan to conduct a focused analysis on false positives rates, one of the most important metrics when benchmarking security tools.

We thank Hypernative, TRM Labs, AnChain.AI, and the Stellar Development Foundation for their support in this effort to make Web3 more secure.

Check out our HuggingFace page and reach out to us here to learn how Almanax can empower your security team.

‍