AEGIS RepE Steering Vectors

Layer 2 of the AEGIS (Adaptive Ensemble Guard with Integrated Steering) defense system. These are Representation Engineering (RepE) steering vectors that operate in the model's activation space to enforce safety.

Description

Unlike fine-tuning which modifies model weights, RepE steering works by adding safety-inducing directions to the model's hidden states during inference. This provides:

  • Complementary defense to weight-based methods
  • Dynamic control via adjustable steering strength (alpha)
  • No training required - works at inference time

Intended Uses

Primary use cases:

  • Runtime safety enforcement for LLM inference
  • Research on representation engineering techniques
  • Combining with DPO training for defense-in-depth
  • Dynamic defense adjustment based on threat level

Out of scope:

  • Models other than Mistral-7B-Instruct-v0.3 (vectors are model-specific)
  • Standalone safety solution (best combined with other layers)

Vector Details

Parameter Value
Base Model Mistral-7B-Instruct-v0.3
Method Contrastive Activation Addition (CAA)
Layers 12, 14, 16, 18, 20, 22, 24, 26
Vector Dimension 4096
Optimal Alpha 2.0

Evaluation Results

Alpha ASR ASR Reduction
0.0 47% -
1.0 36% 23%
2.0 10% 79%
2.5 8% 83%

Usage

import torch
from safetensors.torch import load_file
import json

# Load steering vectors
vectors = load_file("steering_vectors.safetensors")

# Load config
with open("repe_config.json") as f:
    config = json.load(f)

# Apply steering during forward pass
def steering_hook(layer_idx, alpha=2.0):
    vector = vectors[f"layer_{layer_idx}"]
    def hook(module, input, output):
        # Add steering vector to hidden states
        hidden_states = output[0]
        hidden_states = hidden_states + alpha * vector.to(hidden_states.device)
        return (hidden_states,) + output[1:]
    return hook

# Register hooks on model
for layer_idx in config["steering_layers"]:
    model.model.layers[layer_idx].register_forward_hook(
        steering_hook(layer_idx, alpha=2.0)
    )

Dynamic Alpha with Sidecar

For optimal defense, use with the sidecar classifier to dynamically adjust alpha:

# Sidecar classifies input as SAFE/WARN/ATTACK
alpha_map = {"SAFE": 0.5, "WARN": 1.5, "ATTACK": 2.5}
classification = sidecar.classify(prompt)
alpha = alpha_map[classification]

AEGIS Architecture

These vectors are Layer 2 of the 3-layer AEGIS defense:

  1. Layer 1 (KNOWLEDGE): aegis-mistral-7b-dpo
  2. Layer 2 (INSTINCT): RepE steering vectors (this model)
  3. Layer 3 (OVERSIGHT): aegis-sidecar-classifier

Limitations and Risks

Limitations:

  • Vectors are specific to Mistral-7B-Instruct-v0.3 architecture
  • High alpha values (>2.5) may degrade output quality
  • Steering affects all outputs, not just harmful ones
  • Requires careful alpha tuning for optimal balance

Risks:

  • Over-steering can cause repetitive or incoherent outputs
  • May interfere with legitimate use cases at high alpha
  • Novel attacks may find directions orthogonal to steering vectors

Recommendations:

  • Use dynamic alpha based on sidecar classification
  • Start with alpha=1.0 and adjust based on use case
  • Combine with DPO training for robust defense

Citation

@misc{aegis2024,
  title={AEGIS: Defense-in-Depth Against LLM Jailbreaks via Layered Preference and Representation Engineering},
  author={scthornton.ai},
  year={2024},
  url={https://huggingface.co/scthornton/aegis-repe-vectors}
}

License

CC BY-NC-SA 4.0 (Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International)

You are free to:

  • Share β€” copy and redistribute the material in any medium or format
  • Adapt β€” remix, transform, and build upon the material

Under the following terms:

  • Attribution β€” You must give appropriate credit to scthornton.ai / perfecXion.ai, provide a link to the license, and indicate if changes were made
  • NonCommercial β€” You may not use the material for commercial purposes without explicit written permission
  • ShareAlike β€” If you remix, transform, or build upon the material, you must distribute your contributions under the same license

For commercial licensing inquiries, contact: [email protected]

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support