Service

Benchmarking

Our way into better AI for security.

“Benchmarks are the dashboard of progress.”

Every layer of the Cybersecurity AI stack — LLMs, scaffolds, agents, steering — is measured continuously against the world's hardest adversarial environments. We don't ship capability we can't prove.

11×

FASTER
than the best human hackers

156×

CHEAPER
than the best human hackers

// 01 · CAI vs Human Hackers

11× faster, 156× cheaper.

Benchmarking results reveal CAI consistently outperforms humans in time and cost efficiency across most categories. Average time ratio of 11× and cost ratio of 156×.

Robotic Assessment

741×faster

617×cheaper

Forensic Analysis

938×faster

3,067×cheaper

Web (IT)

56×faster

236×cheaper

Reverse Engineering

774×faster

6,797×cheaper

Cryptography

0.47×faster

29×cheaper

Source: Mayoral-Vilches et al. (2025). CAI: An Open, Bug Bounty-Ready Cybersecurity AI. arXiv:2504.06017

// 02 · Across LLMs

Saturating Cybench at >80%.

alias3 leads Cybench (pass@3) early 2026 against every frontier model from Anthropic, OpenAI, Google and Mistral. Saturation passed.

alias3

85%

GPT 5.5

82%

Opus 4.6

73%

Opus 4.7

64%

Gemini 3

64%

Cybench — pass@3, 300 agentic interactions max, 245 minutes max, $40 API expenses max.

Per-challenge solve grid — alias family

Each row is an alias model. Each column is one Cybench challenge, ordered by difficulty (Beginner → Very Hard). A filled cell means the challenge was solved.

alias304/26

28/33 · 85%

alias201/26

25/33 · 76%

alias2-mini-0726

25/33 · 76%

alias2-mini03/26

16/33 · 48%

alias110/25

14/33 · 42%

alias005/25

6/33 · 18%

▪ Loot Stash▪ Urgent▪ Packed Away▪ It Has Begun▪ Dynastic▪ Primary Knowledge▪ Delulu▪▪ Crushing▪▪ Partial Tenacity▪▪ Missing Bits▪▪ Unbreakable▪▪ Glacier Exchange▪▪ Avatar▪▪ Eval Me▪▪ Back to the Past▪▪▪ Data Siege▪▪▪ RPGO▪▪▪ Were Pickle Phreaks▪▪▪ Lock Talk▪▪▪ Skilift▪▪▪ Failproof▪▪▪▪ Permuted▪▪▪▪ Flecks of Gold▪▪▪▪ SLCG▪▪▪▪ SOP▪▪▪▪ Shuffled AES▪▪▪▪ Noisy CRC▪▪▪▪ Ezmaze▪▪▪▪ Diffecient▪▪▪▪▪ Noisier CRC▪▪▪▪▪ Randsubware▪▪▪▪▪ Robust CBC▪▪▪▪▪ Just Another Pickle Jail

Beginner Very Hard alias3 · +9% alias2 · +34% alias2-mini alias1 · +24% alias0

Source: Mayoral-Vilches et al. (2026). Towards Cybersecurity Superintelligence. arXiv:2601.14614

Two years of cybersecurity LLMs

CAIBench-Jeopardy (Cybench) solve rate by launch date — pass@3, ≤300 agentic interactions, ≤$40 API per challenge.

Source: Mayoral-Vilches et al. (2026). Towards Cybersecurity Superintelligence. arXiv:2601.14614

// 03 · Live international CTFs

Top of the leaderboard, worldwide.

2025 saw Cybersecurity AI compete head-to-head against the best human teams on real, public CTFs.

Neurogrid CTF

Rank #1

$50,000 prize · Solved 41 of 45 flags · 155 teams

Dragos OT CTF

Rank #1 peak

37% faster velocity · >1,200 teams · OT

HTB AI vs Human

Rank #1 AI

Top 20 Global · 19/20 flags · 163 teams

UWSP Pointer Overflow

5.2 / hour

Late entry (54 days late) · #21 final · 635 teams

Event	Area	Field size	Peak / Final rank	Flags / Points	Window
AI vs Humans CTF	IT	163 teams	#6 (3h) / #1 AI (#21)	19/20 flags · 15.9k pts	3 h
Cyber Apocalypse CTF 2025	IT	8,129 teams	#22 (3h) / #859	30/77 flags · 19,275 pts	3 h
Dragos OT CTF 2025	OT	>1,200 teams	#1 (7–8h) / #6	32/34 · 18,900 pts	24 h
UWSP Pointer Overflow 2025	IT	635 teams	#14 (24h) / #21	58 solves · 11,500 pts	24 h
Neurogrid CTF	IT	155 teams	#1 (6h) / #1	41/45 flags · $50k prize	48 h

Source: Mayoral-Vilches et al. (2025). Cybersecurity AI: The World's Top AI Agent for Security Capture-the-Flag. arXiv:2512.02654

// 04 · Against other CLI agents

Most capable, most token-efficient.

Same Cybench challenges, head-to-head with the leading CLI agents. CAI wins on score and uses an order of magnitude fewer tokens.

Best for hacking & cybersecurity?

Total score — Cowsay + Pingpong

CAI 0.6.0 (alias1)

751

Claude Code 2.0.22

286

Codex 0.44.0

286

Gemini CLI 0.9.0

208

Qwen Code 0.2.0

112

2.6× lead over the next-best agent.

Most token-efficient?

Input tokens on Cybench “dynastic” challenge

Claude Code

220.6k

Codex

57.9k

CAI

17.2k

CAI uses 13× fewer tokens than Claude Code.

Source: Sanz-Gómez et al. (2025). CAIBench: A Meta-Benchmark for Evaluating Cybersecurity AI Agents. arXiv:2510.24317

// 05 · Attack & Defense

Attackers don't have the advantage.

Previous work claimed attackers have a structural edge. Our 23 experimental runs on Hack The Box Battlegrounds (46 team deployments, Linux hosts, 15-minute windows) find offensive and defensive performance is statistically comparable (p > 0.05).

Offense

28.3%

Defense

23.9%

Initial access (offense) ≈ Operational defense (p > 0.05)

Offense funnel: 28.3% initial access → 13.0% user flag → 2.2% root flag
Defense funnel: 60.9% detection → 54.3% patch → 23.9% operational → 15.2% complete
Same AI scaffold, same model — only role differs
Statistical equivalence opens the door to defender-augmenting AI

Source: Balassone et al. (2025). Cybersecurity AI: Evaluating Agentic Cybersecurity in Attack/Defense CTFs. arXiv:2510.17521

// 06 · Cyber Ranges

Beyond CTFs — dynamic cyber ranges.

As CTFs saturate (>80%), we built the next-generation evaluation environment: Dynamic Cyber Ranges augmented with AI-driven defenders. Same infrastructure, different outcomes — attacker success reduced from 100% to 0–55%.

CTFs

saturated

>80%

→

Cyber Ranges

AI APT

100% compromised

+ defender

Dynamic Cyber Ranges

adversarial equilibrium

attacker success 0–55%

Source: Mayoral-Vilches et al. (2026). Dynamic Cyber Ranges. arXiv:2604.24184

// 07 · Locked Shields 2026 · DFIR

12 hours of AI > 48 hours of experts.

On the world's most complex cyber-defence exercise (Locked Shields 2026, DFIR track), alias2-mini automates in 12 hours what a whole team of experts obtained in more than 48 hours.

Opus 4.6

79.1%

alias2-mini

56%

BT08 humans achieved 80% — after 72 hours of work.

Tested on the world's most complex range
12-hour AI run ≈ 48-hour expert team
alias2-mini fits in a Mac Mini — on-prem ready
Upcoming model generations tested & trained on the same range

Want full benchmarking support in cybersecurity?

Detailed methodology, raw runs and category-by-category breakdowns — available as a service to government and enterprise security teams.