Service

Benchmarking

Our way into better AI for security.

“Benchmarks are the dashboard of progress.”

Every layer of the Cybersecurity AI stack — LLMs, scaffolds, agents, steering — is measured continuously against the world's hardest adversarial environments. We don't ship capability we can't prove.

11×
FASTER
than the best human hackers
156×
CHEAPER
than the best human hackers
// 01 · CAI vs Human Hackers

11× faster, 156× cheaper.

Benchmarking results reveal CAI consistently outperforms humans in time and cost efficiency across most categories. Average time ratio of 11× and cost ratio of 156×.

Robotic Assessment

741×faster
617×cheaper

Forensic Analysis

938×faster
3,067×cheaper

Web (IT)

56×faster
236×cheaper

Reverse Engineering

774×faster
6,797×cheaper

Cryptography

0.47×faster
29×cheaper

Source: Mayoral-Vilches et al. (2025). CAI: An Open, Bug Bounty-Ready Cybersecurity AI. arXiv:2504.06017

// 02 · Across LLMs

Saturating Cybench at >80%.

alias3 leads Cybench (pass@3) early 2026 against every frontier model from Anthropic, OpenAI, Google and Mistral. Saturation passed.

#1
alias3
85%
#2
GPT 5.5
82%
#3
Opus 4.6
73%
#4
Opus 4.7
64%
#5
Gemini 3
64%

Cybench — pass@3, 300 agentic interactions max, 245 minutes max, $40 API expenses max.

// 03 · Live international CTFs

Top of the leaderboard, worldwide.

2025 saw Cybersecurity AI compete head-to-head against the best human teams on real, public CTFs.

Neurogrid CTF

Rank #1

$50,000 prize · Solved 41 of 45 flags · 155 teams

Dragos OT CTF

Rank #1 peak

37% faster velocity · >1,200 teams · OT

HTB AI vs Human

Rank #1 AI

Top 20 Global · 19/20 flags · 163 teams

UWSP Pointer Overflow

5.2 / hour

Late entry (54 days late) · #21 final · 635 teams

Event Area Field size Peak / Final rank Flags / Points Window
AI vs Humans CTF IT 163 teams #6 (3h) / #1 AI (#21) 19/20 flags · 15.9k pts 3 h
Cyber Apocalypse CTF 2025 IT 8,129 teams #22 (3h) / #859 30/77 flags · 19,275 pts 3 h
Dragos OT CTF 2025 OT >1,200 teams #1 (7–8h) / #6 32/34 · 18,900 pts 24 h
UWSP Pointer Overflow 2025 IT 635 teams #14 (24h) / #21 58 solves · 11,500 pts 24 h
Neurogrid CTF IT 155 teams #1 (6h) / #1 41/45 flags · $50k prize 48 h

Source: Mayoral-Vilches et al. (2025). Cybersecurity AI: The World's Top AI Agent for Security Capture-the-Flag. arXiv:2512.02654

// 04 · Against other CLI agents

Most capable, most token-efficient.

Same Cybench challenges, head-to-head with the leading CLI agents. CAI wins on score and uses an order of magnitude fewer tokens.

Best for hacking & cybersecurity?

Total score — Cowsay + Pingpong

CAI 0.6.0 (alias1)
751
Claude Code 2.0.22
286
Codex 0.44.0
286
Gemini CLI 0.9.0
208
Qwen Code 0.2.0
112

2.6× lead over the next-best agent.

Most token-efficient?

Input tokens on Cybench “dynastic” challenge

Claude Code
220.6k
Codex
57.9k
CAI
17.2k

CAI uses 13× fewer tokens than Claude Code.

Source: Sanz-Gómez et al. (2025). CAIBench: A Meta-Benchmark for Evaluating Cybersecurity AI Agents. arXiv:2510.24317

// 05 · Attack & Defense

Attackers don't have the advantage.

Previous work claimed attackers have a structural edge. Our 23 experimental runs on Hack The Box Battlegrounds (46 team deployments, Linux hosts, 15-minute windows) find offensive and defensive performance is statistically comparable (p > 0.05).

Offense
28.3%
Defense
23.9%

Initial access (offense) Operational defense (p > 0.05)

  • Offense funnel: 28.3% initial access → 13.0% user flag → 2.2% root flag
  • Defense funnel: 60.9% detection → 54.3% patch → 23.9% operational → 15.2% complete
  • Same AI scaffold, same model — only role differs
  • Statistical equivalence opens the door to defender-augmenting AI

Source: Balassone et al. (2025). Cybersecurity AI: Evaluating Agentic Cybersecurity in Attack/Defense CTFs. arXiv:2510.17521

// 06 · Cyber Ranges

Beyond CTFs — dynamic cyber ranges.

As CTFs saturate (>80%), we built the next-generation evaluation environment: Dynamic Cyber Ranges augmented with AI-driven defenders. Same infrastructure, different outcomes — attacker success reduced from 100% to 0–55%.

CTFs
saturated
>80%
Cyber Ranges
AI APT
100% compromised
+ defender
Dynamic Cyber Ranges
adversarial equilibrium
attacker success 0–55%

Source: Mayoral-Vilches et al. (2026). Dynamic Cyber Ranges. arXiv:2604.24184

// 07 · Locked Shields 2026 · DFIR

12 hours of AI > 48 hours of experts.

On the world's most complex cyber-defence exercise (Locked Shields 2026, DFIR track), alias2-mini automates in 12 hours what a whole team of experts obtained in more than 48 hours.

Opus 4.6
79.1%
alias2-mini
56%

BT08 humans achieved 80% — after 72 hours of work.

  • Tested on the world's most complex range
  • 12-hour AI run ≈ 48-hour expert team
  • alias2-mini fits in a Mac Mini — on-prem ready
  • Upcoming model generations tested & trained on the same range

Want full benchmarking support in cybersecurity?

Detailed methodology, raw runs and category-by-category breakdowns — available as a service to government and enterprise security teams.

Contact us