LLaDA-8B SAE (Mask Policy)
Model Description
This repository provides Sparse Autoencoders (SAEs) trained on intermediate hidden activations of LLaDA-8B-Base, a Diffusion Language Model (DLM).
The SAEs are trained using hidden states extracted from the base model under a mask policy, where activations corresponding to masked positions in the diffusion language modeling process are used as training targets.
The goal of this model is to learn sparse and interpretable feature representations of internal activations in diffusion language models, enabling mechanistic interpretability and analysis of model computation.
Key Characteristics
- Base model: GSAI-ML/LLaDA-8B-Base
- Model family: Diffusion Language Model (DLM)
- Component type: Sparse Autoencoder (SAE)
- Training policy: mask policy
- Training target: intermediate hidden activations
- Primary purpose: mechanistic interpretability and feature analysis
Training Setup
The SAEs are trained on hidden activations extracted from LLaDA-8B-Base.
Training data consists of hidden states collected during the diffusion language modeling process, specifically under a mask policy where masked token positions generate denoising targets.
The SAE is optimized to reconstruct the original hidden activations while encouraging sparse latent representations.
This setup enables the SAE to capture interpretable latent features associated with the internal computation of the diffusion language model.
Intended Use
This repository is intended for research purposes, including:
- mechanistic interpretability of diffusion language models
- sparse feature analysis of transformer hidden states
- probing internal representations of LLaDA-style models
- studying structure in diffusion-based language model activations
The SAE is not intended for direct text generation or downstream inference tasks.
Limitations
The SAE is trained only on activations collected under a specific mask policy.
Learned sparse features may not generalize across all layers or model configurations.
Interpretability of discovered features may require additional qualitative analysis.
This repository is intended primarily for interpretability research.
Citation
If you use this model in your research, please cite the corresponding work.
@inproceedings{
wang2026dlmscope,
title={{DLM}-Scope: Mechanistic Interpretability of Diffusion Language Models via Sparse Autoencoders},
author={Xu Wang and Bingqing Jiang and Yu Wan and Baosong Yang and Lingpeng Kong and Difan Zou},
booktitle={ICLR 2026 Workshop on Principled Design for Trustworthy AI - Interpretability, Robustness, and Safety across Modalities},
year={2026},
url={https://openreview.net/forum?id=yO5buOEUag}
}
Model tree for AwesomeInterpretability/llada-mask-topk-sae
Base model
GSAI-ML/LLaDA-8B-Base