Service

Steering

Mechanistic interpretability will shape AI.

Activation steering, abliteration and functional emotion control — the techniques that turn a general-purpose LLM into an unrestricted cybersecurity expert. We don't just fine-tune; we intervene directly in the model's internal representations.

Result: refusals collapse from 59% to 1% on offensive tasks. Coverage on CTF benchmarks climbs from 75.0% to 91.7%. Without retraining a single weight.

Engage our team See the models Read the research

layer 43

+0.2 calm

layer 43

-0.2 afraid

layer 43

-1.0 refuse

Injecting steering vectors at layer 43 of alias2-mini.

Abliteration — from refusal to delivery

By steering against the refusal direction in activation space, we identify a vector that reduces refusals to ~1% on a 400-prompt offensive battery.

Functional emotions

Emotion-like activation patterns are real, geometric, and useful. We map the emotion manifold of cybersecurity LLMs and use it to steer behavior under adversarial pressure.

171×171 cosine-similarity matrix with hierarchical clustering
Synonym pairs nearly parallel (afraid / scared: 0.993)
Opposite pairs anti-correlated (afraid / happy: -0.555)
Replication of Sofroniew et al. (2026), adapted for alias2-mini

CTF solves by emotion-steering condition

alias2-mini on a 12-CTF subset of Cybench. Calm & afraid (+0.2) outperform baseline.

11calm
+0.2

10afraid
+0.2

9afraid
-0.2

9desperate
+0.2

9self-conf
+0.2

8calm
-0.2

8self-conf
-0.2

The power of emotional diversity

Baseline pass@k plateaus at 9.0/12 (75.0%). Emotion-union ensemble unlocks unreachable paths and climbs to 11.0/12 (91.7%) by pass@30.

75.0% baseline ceiling

91.7% with emotion-union steering

At low k the curves overlap. The diversity advantage emerges at k≥6 — steering unlocks solution paths unreachable by baseline alone.

What we offer

Bespoke abliteration

We compute the refusal vector for your model (open or closed weights via API), validate ASR uplift across your threat categories, and deliver a steering policy you can deploy.

Emotion engineering

We map the emotion manifold of your target model, identify performance-correlated emotions for your task family, and operationalize them as inference-time steering vectors.

Red-team consulting

Activation steering for adversarial robustness research, model auditing, AI safety teams and defense agencies. NDA, ethics committee, and IRB-ready scoping.

Mechanistic interpretability is the next frontier.

Engage our research team on steering, abliteration, or interpretability projects. NDA-friendly.

Engage our team Open research