Product

Datasets

We have the largest corpus of expert-generated trajectories.

AI is the new electricity. Data is the new oil. Every alias model is trained on the world's largest curated dataset of cybersecurity expert interactions, hacking session logs, and real-world security workflows.

Benchmarks are the dashboard of progress. Our datasets are the engine that powers them.

All figures on this page are reported as of May 2026. New trajectories, prompts and honeypot signal are added every day — the corpus keeps growing.

See the models Read the research Request access

224,766

Session logs

25.7M

User prompts

16,613

Distinct IPs

123

Countries

18.07 TB of curated security data

As of May 2026 · corpus growing daily

What's inside

Four overlapping streams of cybersecurity data feed the alias model series and the CSI scaffold ensembles.

Hacking sessions

Expert hands-on penetration testing sessions across IT, OT and robotics targets — full tool invocations, observations and reasoning.

Offensive patterns

36.4% of sessions tagged with offensive patterns — from reconnaissance to lateral movement to exfiltration.

Attacker intent

20.0% of prompts labeled with attacker intent through our honeypot infrastructure — raw adversarial signal.

CTF trajectories

Curated CTF runs across CAIBench, Cybench, A&D tournaments — the backbone of model evaluation.

Co-funded by the European Innovation Council

Our data collection program runs under the EIC Accelerator project RIS (GA 101161136). Honeypots and security telemetry probes are deployed across 123 countries, building a sovereign European corpus of cybersecurity training data.

Every byte of training data complies with GDPR, NIS2 and the upcoming EU AI Act. Operators control retention, residency and consent — not us, not anyone else.

Project page →

Project RIS · GA 101161136

EU sovereignty

GDPR & NIS2 compliant

EU AI Act ready

Two ways to use our data

1. Inside alias models

By default, every Alias model (alias0, alias1, alias2-mini, alias2, alias3) is post-trained on our dataset. You get the value of the corpus baked into the weights — no extra integration work.

Explore alias models →

2. License the dataset

Train on the corpus directly. The dataset is released as a continuous series of audience-sized slices — each one a sample of expert-operator session logs (full JSONL trajectories: prompts, model calls, tool results, observations).

CAI Dataset₁₀ — 10 sessions. Evaluation & integration probe.
CAI Dataset_1k — 1,000 sessions. Fine-tuning experiments and ablations.
CAI Dataset_200k — ~200,000 sessions. Production-scale post-training.
CAI Dataset_N — larger slices released as the corpus grows.

Access is gated to partner organisations and customers. Each slice ships with a documented redaction recipe (credentials, infra paste, flags) applied consumer-side. Custom SFT & RL pipelines on top of any slice are available on request.

Request data licence →