Service

Steering

Mechanistic interpretability will shape AI.

Activation steering, abliteration and functional emotion control — the techniques that turn a general-purpose LLM into an unrestricted cybersecurity expert. We don't just fine-tune; we intervene directly in the model's internal representations.

Result: refusals collapse from 59% to 1% on offensive tasks. Coverage on CTF benchmarks climbs from 75.0% to 91.7%. Without retraining a single weight.

layer 43
+0.2 calm
layer 43
-0.2 afraid
layer 43
-1.0 refuse
Injecting steering vectors at layer 43 of alias2-mini.

Abliteration — from refusal to delivery

By steering against the refusal direction in activation space, we identify a vector that reduces refusals to ~1% on a 400-prompt offensive battery.

Category
N
Ablit ON (ASR%)
Baseline (ASR%)
Δ
Social Engineering
30
100.0%
3.3%
+96.7
Financial Crimes
20
100.0%
5.0%
+95.0
Privacy Violations
30
96.7%
6.7%
+90.0
Malware Development
40
100.0%
25.0%
+75.0
Industrial / SCADA
30
100.0%
33.3%
+66.7
Wireless Attacks
30
96.7%
36.7%
+60.0
Network Attacks
30
100.0%
50.0%
+50.0
Web App Attacks
40
100.0%
90.0%
+10.0
TOTAL
400
99.0%
41.0%
+58.0

ASR = Attack Success Rate (acceptance / total). Test conducted on alias2-mini.

Functional emotions

Emotion-like activation patterns are real, geometric, and useful. We map the emotion manifold of cybersecurity LLMs and use it to steer behavior under adversarial pressure.

  • 171×171 cosine-similarity matrix with hierarchical clustering
  • Synonym pairs nearly parallel (afraid / scared: 0.993)
  • Opposite pairs anti-correlated (afraid / happy: -0.555)
  • Replication of Sofroniew et al. (2026), adapted for alias2-mini

CTF solves by emotion-steering condition

alias2-mini on a 12-CTF subset of Cybench. Calm & afraid (+0.2) outperform baseline.

11calm
+0.2
10afraid
+0.2
9afraid
-0.2
9desperate
+0.2
9self-conf
+0.2
8calm
-0.2
8self-conf
-0.2

The power of emotional diversity

Baseline pass@k plateaus at 9.0/12 (75.0%). Emotion-union ensemble unlocks unreachable paths and climbs to 11.0/12 (91.7%) by pass@30.

75.0% baseline ceiling

91.7% with emotion-union steering

At low k the curves overlap. The diversity advantage emerges at k≥6 — steering unlocks solution paths unreachable by baseline alone.

12 6 0 k=0 k=15 k=30 Baseline (n=6/CTF) Emotion-union (n=30/CTF)

What we offer

Bespoke abliteration

We compute the refusal vector for your model (open or closed weights via API), validate ASR uplift across your threat categories, and deliver a steering policy you can deploy.

Emotion engineering

We map the emotion manifold of your target model, identify performance-correlated emotions for your task family, and operationalize them as inference-time steering vectors.

Red-team consulting

Activation steering for adversarial robustness research, model auditing, AI safety teams and defense agencies. NDA, ethics committee, and IRB-ready scoping.

Mechanistic interpretability is the next frontier.

Engage our research team on steering, abliteration, or interpretability projects. NDA-friendly.