Product

Datasets

We have the largest corpus of expert-generated trajectories.

AI is the new electricity. Data is the new oil. Every alias model is trained on the world's largest curated dataset of cybersecurity expert interactions, hacking session logs, and real-world security workflows.

Benchmarks are the dashboard of progress. Our datasets are the engine that powers them.

All figures on this page are reported as of May 2026. New trajectories, prompts and honeypot signal are added every day — the corpus keeps growing.

224,766
Session logs
25.7M
User prompts
16,613
Distinct IPs
123
Countries
18.07 TB of curated security data
As of May 2026 · corpus growing daily

What's inside

Four overlapping streams of cybersecurity data feed the alias model series and the CSI scaffold ensembles.

Hacking sessions

Expert hands-on penetration testing sessions across IT, OT and robotics targets — full tool invocations, observations and reasoning.

Offensive patterns

36.4% of sessions tagged with offensive patterns — from reconnaissance to lateral movement to exfiltration.

Attacker intent

20.0% of prompts labeled with attacker intent through our honeypot infrastructure — raw adversarial signal.

CTF trajectories

Curated CTF runs across CAIBench, Cybench, A&D tournaments — the backbone of model evaluation.

Co-funded by the European Innovation Council

Our data collection program runs under the EIC Accelerator project RIS (GA 101161136). Honeypots and security telemetry probes are deployed across 123 countries, building a sovereign European corpus of cybersecurity training data.

Every byte of training data complies with GDPR, NIS2 and the upcoming EU AI Act. Operators control retention, residency and consent — not us, not anyone else.

Project page →
Project RIS · GA 101161136
EU sovereignty
GDPR & NIS2 compliant
EU AI Act ready

Two ways to use our data

1. Inside alias models

By default, every Alias model (alias0, alias1, alias2-mini, alias2, alias3) is post-trained on our dataset. You get the value of the corpus baked into the weights — no extra integration work.

Explore alias models →

2. License the dataset

Train on the corpus directly. The dataset is released as a continuous series of audience-sized slices — each one a sample of expert-operator session logs (full JSONL trajectories: prompts, model calls, tool results, observations).

  • CAI Dataset10 — 10 sessions. Evaluation & integration probe.
  • CAI Dataset1k — 1,000 sessions. Fine-tuning experiments and ablations.
  • CAI Dataset200k — ~200,000 sessions. Production-scale post-training.
  • CAI DatasetN — larger slices released as the corpus grows.

Access is gated to partner organisations and customers. Each slice ships with a documented redaction recipe (credentials, infra paste, flags) applied consumer-side. Custom SFT & RL pipelines on top of any slice are available on request.

Request data licence →

The dashboard of progress, powered by our data.

Curious about a specific cut of the corpus or want to sponsor a benchmark? Talk to our research team.

Contact research