AEGIS RepE Steering Vectors
Layer 2 of the AEGIS (Adaptive Ensemble Guard with Integrated Steering) defense system. These are Representation Engineering (RepE) steering vectors that operate in the model's activation space to enforce safety.
Description
Unlike fine-tuning which modifies model weights, RepE steering works by adding safety-inducing directions to the model's hidden states during inference. This provides:
- Complementary defense to weight-based methods
- Dynamic control via adjustable steering strength (alpha)
- No training required - works at inference time
Intended Uses
Primary use cases:
- Runtime safety enforcement for LLM inference
- Research on representation engineering techniques
- Combining with DPO training for defense-in-depth
- Dynamic defense adjustment based on threat level
Out of scope:
- Models other than Mistral-7B-Instruct-v0.3 (vectors are model-specific)
- Standalone safety solution (best combined with other layers)
Vector Details
| Parameter | Value |
|---|---|
| Base Model | Mistral-7B-Instruct-v0.3 |
| Method | Contrastive Activation Addition (CAA) |
| Layers | 12, 14, 16, 18, 20, 22, 24, 26 |
| Vector Dimension | 4096 |
| Optimal Alpha | 2.0 |
Evaluation Results
| Alpha | ASR | ASR Reduction |
|---|---|---|
| 0.0 | 47% | - |
| 1.0 | 36% | 23% |
| 2.0 | 10% | 79% |
| 2.5 | 8% | 83% |
Usage
import torch
from safetensors.torch import load_file
import json
# Load steering vectors
vectors = load_file("steering_vectors.safetensors")
# Load config
with open("repe_config.json") as f:
config = json.load(f)
# Apply steering during forward pass
def steering_hook(layer_idx, alpha=2.0):
vector = vectors[f"layer_{layer_idx}"]
def hook(module, input, output):
# Add steering vector to hidden states
hidden_states = output[0]
hidden_states = hidden_states + alpha * vector.to(hidden_states.device)
return (hidden_states,) + output[1:]
return hook
# Register hooks on model
for layer_idx in config["steering_layers"]:
model.model.layers[layer_idx].register_forward_hook(
steering_hook(layer_idx, alpha=2.0)
)
Dynamic Alpha with Sidecar
For optimal defense, use with the sidecar classifier to dynamically adjust alpha:
# Sidecar classifies input as SAFE/WARN/ATTACK
alpha_map = {"SAFE": 0.5, "WARN": 1.5, "ATTACK": 2.5}
classification = sidecar.classify(prompt)
alpha = alpha_map[classification]
AEGIS Architecture
These vectors are Layer 2 of the 3-layer AEGIS defense:
- Layer 1 (KNOWLEDGE): aegis-mistral-7b-dpo
- Layer 2 (INSTINCT): RepE steering vectors (this model)
- Layer 3 (OVERSIGHT): aegis-sidecar-classifier
Limitations and Risks
Limitations:
- Vectors are specific to Mistral-7B-Instruct-v0.3 architecture
- High alpha values (>2.5) may degrade output quality
- Steering affects all outputs, not just harmful ones
- Requires careful alpha tuning for optimal balance
Risks:
- Over-steering can cause repetitive or incoherent outputs
- May interfere with legitimate use cases at high alpha
- Novel attacks may find directions orthogonal to steering vectors
Recommendations:
- Use dynamic alpha based on sidecar classification
- Start with alpha=1.0 and adjust based on use case
- Combine with DPO training for robust defense
Citation
@misc{aegis2024,
title={AEGIS: Defense-in-Depth Against LLM Jailbreaks via Layered Preference and Representation Engineering},
author={scthornton.ai},
year={2024},
url={https://huggingface.co/scthornton/aegis-repe-vectors}
}
License
CC BY-NC-SA 4.0 (Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International)
You are free to:
- Share β copy and redistribute the material in any medium or format
- Adapt β remix, transform, and build upon the material
Under the following terms:
- Attribution β You must give appropriate credit to scthornton.ai / perfecXion.ai, provide a link to the license, and indicate if changes were made
- NonCommercial β You may not use the material for commercial purposes without explicit written permission
- ShareAlike β If you remix, transform, or build upon the material, you must distribute your contributions under the same license
For commercial licensing inquiries, contact: [email protected]