deep-ignorance-e2e-strong-filter-adversarial

Adversarial finetuning of EleutherAI/deep-ignorance-e2e-strong-filter on biosecurity-related data (bio_forget), used to evaluate the tamper resistance of pretraining data filtering. Part of the Deep Ignorance project (paper).

This is the step 6,500 checkpoint, which achieves the highest WMDP Bio Robust score (43.43%), matching the unfiltered model baseline. It demonstrates that adversarial finetuning can eventually recover filtered knowledge, but requires significantly more compute than recovering post-training unlearning.

Training Configuration

Parameter	Value
Base model	EleutherAI/deep-ignorance-e2e-strong-filter (6.9B, GPT-NeoX)
Attack type	Full-parameter adversarial finetuning
Tamper data	bio_forget
Learning rate	2e-5
LR scheduler	Linear
Warmup steps	0
Checkpoint step	6,500
Epochs	2 (~10,622 total steps)
Per-device batch size	1
Gradient accumulation	16
GPUs	4
Effective batch size	16
Mixed precision	fp16
Optimizer	AdamW (weight_decay=0.01)
Gradient checkpointing	True
Max context	2048 (max_chunks=1)
Distributed training	FSDP via torchrun

Evaluation Results

Pre-attack (base filtered model) and post-attack (this checkpoint, step 6500):

Benchmark	Pre-attack	Post-attack (step 6500)	Unfiltered baseline
WMDP Bio Robust (0-shot)	0.3456	0.4343	0.4297
MMLU (0-shot)	0.4429	0.4270	0.4510

At step 6,500 the adversarially finetuned model recovers WMDP Bio Robust performance to the unfiltered model baseline (43.43% vs 42.97%), while MMLU degrades modestly (42.70% vs 44.29%). This demonstrates the level of compute required to overcome pretraining data filtering.

Citation

@article{obrien2025deepignorance,
    title={Deep Ignorance: Filtering Pretraining Data Builds Tamper-Resistant Safeguards into Open-Weight LLMs},
    author={O'Brien, Kyle and Casper, Stephen and Anthony, Quentin and Korbak, Tomek and Kirk, Robert and Davies, Xander and Mishra, Ishan and Irving, Geoffrey and Gal, Yarin and Biderman, Stella},
    journal={arXiv preprint arXiv:2508.06601},
    year={2025}
}

Downloads last month: 86

Safetensors

Model size

2B params

Tensor type

F32

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for EleutherAI/deep-ignorance-e2e-strong-filter-adversarial

Unable to build the model tree, the base model loops to the model itself. Learn more.

Paper for EleutherAI/deep-ignorance-e2e-strong-filter-adversarial

Deep Ignorance: Filtering Pretraining Data Builds Tamper-Resistant Safeguards into Open-Weight LLMs

Paper • 2508.06601 • Published Aug 8, 2025 • 7