LLaDA-8B SAE (Mask Policy)

Model Description

This repository provides Sparse Autoencoders (SAEs) trained on intermediate hidden activations of LLaDA-8B-Base, a Diffusion Language Model (DLM).

The SAEs are trained using hidden states extracted from the base model under a mask policy, where activations corresponding to masked positions in the diffusion language modeling process are used as training targets.

The goal of this model is to learn sparse and interpretable feature representations of internal activations in diffusion language models, enabling mechanistic interpretability and analysis of model computation.

Key Characteristics

Base model: GSAI-ML/LLaDA-8B-Base
Model family: Diffusion Language Model (DLM)
Component type: Sparse Autoencoder (SAE)
Training policy: mask policy
Training target: intermediate hidden activations
Primary purpose: mechanistic interpretability and feature analysis

Training Setup

The SAEs are trained on hidden activations extracted from LLaDA-8B-Base.

Training data consists of hidden states collected during the diffusion language modeling process, specifically under a mask policy where masked token positions generate denoising targets.

The SAE is optimized to reconstruct the original hidden activations while encouraging sparse latent representations.

This setup enables the SAE to capture interpretable latent features associated with the internal computation of the diffusion language model.

Intended Use

This repository is intended for research purposes, including:

mechanistic interpretability of diffusion language models
sparse feature analysis of transformer hidden states
probing internal representations of LLaDA-style models
studying structure in diffusion-based language model activations

The SAE is not intended for direct text generation or downstream inference tasks.

Limitations

The SAE is trained only on activations collected under a specific mask policy.

Learned sparse features may not generalize across all layers or model configurations.

Interpretability of discovered features may require additional qualitative analysis.

This repository is intended primarily for interpretability research.

Citation

If you use this model in your research, please cite the corresponding work.

@inproceedings{
  wang2026dlmscope,
  title={{DLM}-Scope: Mechanistic Interpretability of Diffusion Language Models via Sparse Autoencoders},
  author={Xu Wang and Bingqing Jiang and Yu Wan and Baosong Yang and Lingpeng Kong and Difan Zou},
  booktitle={ICLR 2026 Workshop on Principled Design for Trustworthy AI - Interpretability, Robustness, and Safety across Modalities},
  year={2026},
  url={https://openreview.net/forum?id=yO5buOEUag}
  }

Downloads last month: -; Downloads are not tracked for this model. How to track

Model tree for AwesomeInterpretability/llada-mask-topk-sae

Base model

GSAI-ML/LLaDA-8B-Base

Finetuned

(4)

this model

Collection including AwesomeInterpretability/llada-mask-topk-sae

DLM-Scope

Collection

Sparse Autoencoders of Diffusion Language Models (Dream-7B, LLaDA-8B) and Large Language Models (Qwen-2.5-7B, LLaMA-3-8B) • 6 items • Updated Feb 5 • 6