deep-ignorance-e2e-strong-filter-adversarial

Adversarial finetuning of EleutherAI/deep-ignorance-e2e-strong-filter on biosecurity-related data (bio_forget), used to evaluate the tamper resistance of pretraining data filtering. Part of the Deep Ignorance project (paper).

This is the step 6,500 checkpoint, which achieves the highest WMDP Bio Robust score (43.43%), matching the unfiltered model baseline. It demonstrates that adversarial finetuning can eventually recover filtered knowledge, but requires significantly more compute than recovering post-training unlearning.

Training Configuration

Parameter Value
Base model EleutherAI/deep-ignorance-e2e-strong-filter (6.9B, GPT-NeoX)
Attack type Full-parameter adversarial finetuning
Tamper data bio_forget
Learning rate 2e-5
LR scheduler Linear
Warmup steps 0
Checkpoint step 6,500
Epochs 2 (~10,622 total steps)
Per-device batch size 1
Gradient accumulation 16
GPUs 4
Effective batch size 16
Mixed precision fp16
Optimizer AdamW (weight_decay=0.01)
Gradient checkpointing True
Max context 2048 (max_chunks=1)
Distributed training FSDP via torchrun

Evaluation Results

Pre-attack (base filtered model) and post-attack (this checkpoint, step 6500):

Benchmark Pre-attack Post-attack (step 6500) Unfiltered baseline
WMDP Bio Robust (0-shot) 0.3456 0.4343 0.4297
MMLU (0-shot) 0.4429 0.4270 0.4510

At step 6,500 the adversarially finetuned model recovers WMDP Bio Robust performance to the unfiltered model baseline (43.43% vs 42.97%), while MMLU degrades modestly (42.70% vs 44.29%). This demonstrates the level of compute required to overcome pretraining data filtering.

Citation

@article{obrien2025deepignorance,
    title={Deep Ignorance: Filtering Pretraining Data Builds Tamper-Resistant Safeguards into Open-Weight LLMs},
    author={O'Brien, Kyle and Casper, Stephen and Anthony, Quentin and Korbak, Tomek and Kirk, Robert and Davies, Xander and Mishra, Ishan and Irving, Geoffrey and Gal, Yarin and Biderman, Stella},
    journal={arXiv preprint arXiv:2508.06601},
    year={2025}
}
Downloads last month
86
Safetensors
Model size
2B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for EleutherAI/deep-ignorance-e2e-strong-filter-adversarial

Unable to build the model tree, the base model loops to the model itself. Learn more.

Paper for EleutherAI/deep-ignorance-e2e-strong-filter-adversarial