Qwen3.6-35B-A3B SAE @ residual L23 (work in progress)

Top-K Sparse Autoencoder trained on the residual stream at layer 23 of Qwen3.6-35B-A3B, a hybrid Mixture-of-Experts + Gated DeltaNet + Gated Attention reasoning model.

This is the first public SAE for Qwen3.6 reasoning models (the post-trained reasoning-tuned series). Complementary to Qwen-Scope (Apr 27 2026), which covers the Qwen3.5/Qwen3 base + MoE families — see "Coverage" below.

Architecture

Item	Value
Base model	`Qwen/Qwen3.6-35B-A3B` (hybrid MoE+GDN+GA, 40 layers)
SAE position	residual stream at `layer 23`
Type	Top-K + AuxK (Gao et al. 2024)
Dictionary width	`d_sae = 32,768` (16× expansion over `d_model=2,048`)
Sparsity	`k=128` active features
AuxK	`k_aux=2,560` for dead-feature mitigation

Layer choice rationale

S4-L0 layer mapping (project memory: project_mechreward_full_state_20260417) identified:

Peak contrastive Cohen's d at L22 (55% depth)
Logit-lens commit at L39 (17-layer gap, largest measured across architectures)
L23 = downstream Gated Attention adjacent to peak (following Qwen3.5-4B peak+5 pattern)

Training status (WIP)

Metric	Value
Tokens processed	92M (step 22,674)
Variance explained (avg)	0.835
Variance explained (peak)	0.866
Loss-recovered	(in progress, target ≥0.99)
Alive-feature fraction	(in progress, target ≥0.95)
Training corpus	FineWeb-Edu
Target tokens	1B (currently 9.2% complete)

Validated downstream

Stage Gate 1 (G1) correlation pre-test on SuperGPQA:

Spearman ρ = 0.5217 (p = 2.62 × 10⁻⁸)
100 held-out questions
Matches Qwen3.5-4B (ρ=0.540) with 46% of training-token budget (92M vs 200M)
First cross-architecture confirmation of mechreward thesis (dense GDN → triple-hybrid MoE+GDN+GA)

Coverage relative to Qwen-Scope (Apr 2026)

Qwen-Scope shipped official SAEs for the Qwen base + MoE families:

Model family	Qwen-Scope	This work
Qwen3.5-2B/9B/27B base	✅ W32K-W80K, L0_50/L0_100	—
Qwen3-30B-A3B base (MoE)	✅ W32K-W128K	—
Qwen3.5-35B-A3B base (MoE)	✅ W32K-W128K	—
Qwen3.6-35B-A3B reasoning	❌	✅ this repo
Qwen3.6-27B reasoning (dense)	❌	`qwen36-27b-sae-papergrade`

Qwen-Scope and this work are complementary: Qwen-Scope covers the base / pre-trained variants; OpenInterp covers the post-trained reasoning-tuned variants where mechanistic interpretability methodology meets deployment.

Loading

import torch

# Download checkpoint
from huggingface_hub import hf_hub_download
sae_path = hf_hub_download(
    repo_id="caiovicentino1/Qwen3.6-35B-A3B-SAE-L23-topk-wip",
    filename="sae_step_22674_92M_tokens.pt",
)

# Load (sae-lens compatible format)
state = torch.load(sae_path, map_location="cpu")
print(f"Keys: {list(state.keys())}")

Citation

If you use this SAE, please cite:

@misc{vicentino2026qwen36sae,
  author = {Vicentino, Caio Sanford Guimar{\~a}es},
  title = {Qwen3.6-35B-A3B SAE @ L23 — Work in Progress},
  year = {2026},
  publisher = {HuggingFace},
  url = {https://huggingface.co/caiovicentino1/Qwen3.6-35B-A3B-SAE-L23-topk-wip},
}

Methodology lineage

Top-K + AuxK SAE (Gao et al. 2024, OpenAI) — base methodology
Gemma Scope (DeepMind 2024) — analogous layer-coverage approach for Gemma family
Qwen-Scope (Alibaba Apr 2026) — analogous for Qwen3.5/Qwen3 base/MoE families
SAEBench / InterpScore — for evaluation criteria
Mechreward Stage 4 (this work) — downstream validation via reasoning-trace correlation

Status / Roadmap

S4-L0 layer mapping (peak L22, target L23)
SAE training to 92M tokens (var_exp 0.835)
G1 correlation pre-test passed (ρ=0.5217)
Continue training to 200M+ tokens
G2 tiny RL on contrastive features
G3 full RL with mech-reward
Publish probe-validated feature set
Submit to ICML / NeurIPS Mech Interp Workshop

Part of OpenInterpretability

Built as part of the OpenInterpretability ecosystem — open-source mechanistic interpretability tooling for frontier reasoning models.

Other artifacts:

caiovicentino1/qwen36-27b-sae-papergrade — paper-grade SAE for Qwen3.6-27B reasoning (dense)
caiovicentino1/FabricationGuard-linearprobe-qwen36-27b — production hallucination probe
caiovicentino1/ReasoningGuard-linearprobe-qwen36-27b — reasoning faithfulness probe (honest narrow scope)
ProbeBench — categorical leaderboard for activation probes

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for caiovicentino1/Qwen3.6-35B-A3B-SAE-L23-topk-wip

Base model

Qwen/Qwen3.6-35B-A3B

Finetuned

(134)

this model