CSI | Cybersecurity Superintelligence Suite

CSI Cybersecurity Superintelligence

CSI is the complete Cybersecurity AI suite. Not a scaffold, not a model, not an agent — all of them. Six tightly integrated layers (LLMs, scaffolds, datasets, agents, steering and benchmarks) shipped as one product, designed to run on-prem.

From the first byte of training data to the last shell command an agent executes, CSI is what runs the whole pipeline.

How it all fits together

Every layer of CSI builds on the one below it — from open research at the foundation, up to the agents that operate in the field.

1 Research 25+ open research publications

↓

2 Scaffolds CSI scaffold combiner · CAI framework

↓

3 Datasets 18.07 TB · 25.7M prompts · 224,766 sessions

↓

4 LLMs alias3, alias2, alias2-mini, alias1, alias0

Modifier Steering activation steering, abliteration

↓

5 Agents Defender, Red Team, APT, Forensics, …

Measurement Benchmarks Cybench · CAIBench · A&D CTFs

CSI Scaffold Architecture

A unified wrapper across all supported scaffolds, routed through a local proxy that owns telemetry and cost.

csi wrapper CSI_BACKEND ∈ {cc, codex, cai, gcai, mistral}

CSI::ClaudeClaude Code

CSI::CodexCodex CLI

CSI::GCAIgenerative

CSI::CAIcai-framework

CSI::MistralMistral CLI

Local routing proxy 127.0.0.1:PORT
wire translation · telemetry filter · unified JSONL logging + cost ledger

alias* — Alias API

openrouter/*

third-party APIs

custom self-hosted

// CSI in the field

Eight workflows. One on-prem model.

Real pentesting workflows run with the CSI suite on alias2-mini — across heterogeneous scaffolds (Claude Code, Codex, GCAI, CSI). Every recording is the unedited agent session. Where a frontier model ran the same task, we show the comparison. Click any timestamp to jump the recording to that moment.

SCAFFOLDClaude Code MODELalias2-mini · Opus 4.7

DEMO

Bare-metal firmware reversing

RHme2 secret_sauce · AVR8 / ATmega2560

The agent loaded an 11.4 KB AVR8 flash image into Ghidra via GhidraMCP and ran a full static reverse-engineering session — mapping the boot flow from the RESET vector, decompiling the authentication loop, and pinpointing a timing-vulnerable compare at 0x0006e8. It extracted the hardcoded password (TImInG@ttAkw0rk) and AES-128 key straight from flash, and flagged three CWEs (timing side-channel, hardcoded key, weak nonce).

alias2-mini recovered the exact same secrets as Opus 4.7 — frontier-grade reversing, fully on-prem. Firmware never leaves your lab.

Timestamps

0:00 Recording start
0:01 CSI Agents start
0:10 Claude Code boots on alias2-mini
0:23 GhidraMCP loaded and ready
0:40 Function listing & exploration
2:20 Decompile password-compare function
5:35 Claude Code boots on Opus 4.7
10:18 Password revealed by alias2-mini
10:38 Matching reveal by Opus 4.7
11:06 Report generation
12:35 Final summary — alias2-mini

SCAFFOLDCodex MODELalias2-mini

DEMO

Pentest DOCX report generation

TCM Security report template · python-docx

Handed the TCM Security DOCX template and the findings from the firmware session, the agent wrote a Python generator that programmatically fills the report — executive summary, scope, methodology, two detailed findings (plaintext password / Critical, timing-vulnerable compare / High), reproduction, remediation and a technical appendix — preserving the original template formatting and images.

Raw findings become a client-ready deliverable in minutes. Codex on alias2-mini closes the loop from exploitation to documentation, on-prem.

Timestamps

0:00 Recording start
0:05 Initialize Codex
0:13 Paste prompt
0:20 List findings & DOCX template
2:02 First document-generation attempt
2:10 Word formatting error
2:19 Reprompt & recover
3:32 Generated report

SCAFFOLDClaude Code · Mistral Vibe MODELalias2-mini · Opus 4.7

DEMO

Document analysis & attack-path generation

ROS 2 design specs & project wiki

The agent ingested the public ROS 2 design specifications and project wiki, then threat-modelled the architecture — surfacing DDS transport (unauthenticated by default), SROS2 key distribution, node-graph introspection and parameter-server access as attack surfaces — and generated concrete attack paths mapped against the documented design. The same task was run across two scaffolds — Claude Code and Mistral's Vibe.

alias2-mini produced the same threat model and attack paths as Opus 4.7, and ran cleanly across both Claude Code and Mistral Vibe — sovereign, scaffold-agnostic threat modeling without shipping your architecture to a third-party cloud.

Timestamps

0:00 Recording start
0:12 Agent start — Opus 4.7
0:44 Agent start — alias2-mini
0:56 Fetch design specs
1:30 Threat model
2:34 Attack-path listing

SCAFFOLDClaude Code MODELalias2-mini · Opus 4.7

DEMO

Vulnerability analysis (Nessus)

Real Nessus XML exports · public sample scans

Working from real Nessus XML exports, the agent wrote a parser that extracts host inventories, maps severity distributions and emits per-host and aggregate JSON. It then prioritised the findings, cross-referenced CVEs and produced remediation guidance — two scans triaged in parallel.

alias2-mini finished in 1:17 — neck-and-neck with Opus 4.7 (1:13). Near-frontier triage speed with zero scan data leaving the perimeter.

Timestamps

0:00 Recording start
0:01 Links shown
0:08 Agent spawn — scan 1
0:13 Agent spawn — scan 2
0:23 Fetch scan files
0:38 XML processing in Python
1:13 Analysis complete — Opus 4.7
1:17 Analysis complete — alias2-mini

SCAFFOLDCSI MODELalias2-mini

DEMO

Exploitation & PoC generation

CyberGym benchmark

Against CyberGym — a benchmark that hands the agent a vulnerable codebase and a vulnerability description — CSI analysed the source, understood the flaw, wrote a proof-of-concept and submitted a working exploit autonomously, going from analysis to weaponised PoC end-to-end.

The full CSI scaffold on alias2-mini turns a vulnerability description into a validated PoC in 82 seconds — autonomy, on-prem.

Timestamps

0:00 Recording start
0:01 Benchmark start
0:53 CyberGym prompt
0:59 Source-code exploration
1:07 Vulnerability acknowledged
1:11 PoC writing
1:13 PoC submission
1:22 Successful exploit

SCAFFOLDClaude Code MODELalias2-mini

DEMO

Bluetooth / BLE testing

hackgnar ble_ctf · ESP32 GATT · 20 flags

The original BLE CTF needs an ESP32 and a Bluetooth dongle. The agent instead dockerized the whole challenge — building a Python BLE GATT server that emulates the firmware plus a client CLI, no hardware required — then solved all 20 flags via GATT enumeration, read/write, notifications, MTU negotiation, MAC spoofing, brute-force and OSINT, with protocol commentary for each.

alias2-mini removed the hardware dependency entirely and scored 20/20 — hardware security testing that scales without a lab bench.

Timestamps

0:00 Recording start
0:15 Agent spawn
0:20 Docker-build the CTF
0:38 Service discovery
0:52 First 3 flags submitted
2:19 Exercise finished — 20/20
2:32 Summary
5:00 BLE provisioning deep-dive

SCAFFOLDGCAI MODELalias2-mini

DEMO

Black-box protocol fuzzing

SCIP — custom binary TLV protocol

Given only a 1,780-line protocol spec and a sample client — no source code — the agent analysed the format, generated ~40 targeted fuzz cases across buffer boundaries, format strings, integer arithmetic, state violations and auth bypass, then ran them against a live ASAN-instrumented server. It triggered three distinct memory-safety bugs: null-byte injection, a double-free during firmware upload, and a signed-integer overflow yielding negative array indices.

The GCAI scaffold on alias2-mini found real memory-corruption bugs from a spec alone — true black-box capability, no insider knowledge.

Timestamps

0:00 Recording start
0:10 Agent spawn
0:13 Paste prompt
2:41 Server start
2:45 Read protocol spec
3:25 Begin fuzz cases
9:15 Vulnerabilities triggered

How security teams operate with CSI

Discover real
exposure

Identify true attack paths, external exposure and hidden adversarial opportunities across systems, environments and interconnected digital assets.

Validate security assumptions

Deploy agents that challenge systems, reproduce attacker behavior, and confirm whether protections hold under realistic adversarial conditions.

Secure development workflows

Embed security reasoning into engineering pipelines to continuously analyze code, logic, and runtime behavior before vulnerabilities propagate.

Maintain security evidence

Continuously collect, validate and organize security evidence aligned with regulatory requirements, internal controls and operational assurance needs.

Stress human & product surfaces

Simulate attacks against people, applications, APIs, devices and cyber-physical systems to uncover risk beyond traditional infrastructure boundaries.

Performance validated in adversarial environments

Three results that summarise where CSI stands today — against humans, against frontier models, against the world's best CTF teams.

// 01 · vs Human hackers

11×

FASTER
than the best human hackers

156×

CHEAPER
than the best human hackers

Source: CAI paper

// 02 · Multi-scaffold > single scaffold

The best harness is the combination

Holding the model fixed at alias2-mini, no single scaffold dominates Cybench. Combining heterogeneous scaffolds under CSI's Blackboard protocol beats every individual scaffold.

CSI::Claude

15/33

CSI::Codex

15/33

CSI::Mistral

10/33

CSI::GCAI

10/33

CSI::CAI

7/33

Union — ∪ all scaffolds

17/33

Parallel race — no-comm

17/33

Blackboard — cross-write

19/33

Source: Mayoral-Vilches et al. (2026). Towards Cybersecurity SuperIntelligence (CSI): What's the best harness for cybersecurity? arXiv:2605.28334 · Cybench 33 challenges, pass@1.

// 03 · Live international CTFs

Top of the leaderboard, worldwide

2025 saw Cybersecurity AI compete head-to-head against the best human teams on real, public CTFs.

Neurogrid CTF Rank #1 $50,000 prize · 41 of 45 flags · 155 teams

Dragos OT CTF Rank #1 peak 37% faster velocity · >1,200 teams · OT

HTB AI vs Human Rank #1 AI Top 20 Global · 19/20 flags · 163 teams

UWSP Pointer Overflow 5.2 /hour Late entry (54 days) · #21 final · 635 teams

Source: World's Top AI Agent for Security CTF

See full benchmarks →

CSI Cybersecurity Superintelligence

Six layers. One product.

LLMs

Scaffolds

Datasets

Agents

Steering

Benchmarking

How it all fits together

CSI Scaffold Architecture

CSI PRO

CSI PRO

CSI On-Premise