Qwen3.6-35B-A3B SAE @ residual L23 (work in progress)
Top-K Sparse Autoencoder trained on the residual stream at layer 23 of Qwen3.6-35B-A3B, a hybrid Mixture-of-Experts + Gated DeltaNet + Gated Attention reasoning model.
This is the first public SAE for Qwen3.6 reasoning models (the post-trained reasoning-tuned series). Complementary to Qwen-Scope (Apr 27 2026), which covers the Qwen3.5/Qwen3 base + MoE families β see "Coverage" below.
Architecture
| Item | Value |
|---|---|
| Base model | Qwen/Qwen3.6-35B-A3B (hybrid MoE+GDN+GA, 40 layers) |
| SAE position | residual stream at layer 23 |
| Type | Top-K + AuxK (Gao et al. 2024) |
| Dictionary width | d_sae = 32,768 (16Γ expansion over d_model=2,048) |
| Sparsity | k=128 active features |
| AuxK | k_aux=2,560 for dead-feature mitigation |
Layer choice rationale
S4-L0 layer mapping (project memory: project_mechreward_full_state_20260417) identified:
- Peak contrastive Cohen's d at L22 (55% depth)
- Logit-lens commit at L39 (17-layer gap, largest measured across architectures)
- L23 = downstream Gated Attention adjacent to peak (following Qwen3.5-4B
peak+5pattern)
Training status (WIP)
| Metric | Value |
|---|---|
| Tokens processed | 92M (step 22,674) |
| Variance explained (avg) | 0.835 |
| Variance explained (peak) | 0.866 |
| Loss-recovered | (in progress, target β₯0.99) |
| Alive-feature fraction | (in progress, target β₯0.95) |
| Training corpus | FineWeb-Edu |
| Target tokens | 1B (currently 9.2% complete) |
Validated downstream
Stage Gate 1 (G1) correlation pre-test on SuperGPQA:
- Spearman Ο = 0.5217 (p = 2.62 Γ 10β»βΈ)
- 100 held-out questions
- Matches Qwen3.5-4B (Ο=0.540) with 46% of training-token budget (92M vs 200M)
- First cross-architecture confirmation of mechreward thesis (dense GDN β triple-hybrid MoE+GDN+GA)
Coverage relative to Qwen-Scope (Apr 2026)
Qwen-Scope shipped official SAEs for the Qwen base + MoE families:
| Model family | Qwen-Scope | This work |
|---|---|---|
| Qwen3.5-2B/9B/27B base | β W32K-W80K, L0_50/L0_100 | β |
| Qwen3-30B-A3B base (MoE) | β W32K-W128K | β |
| Qwen3.5-35B-A3B base (MoE) | β W32K-W128K | β |
| Qwen3.6-35B-A3B reasoning | β | β this repo |
| Qwen3.6-27B reasoning (dense) | β | qwen36-27b-sae-papergrade |
Qwen-Scope and this work are complementary: Qwen-Scope covers the base / pre-trained variants; OpenInterp covers the post-trained reasoning-tuned variants where mechanistic interpretability methodology meets deployment.
Loading
import torch
# Download checkpoint
from huggingface_hub import hf_hub_download
sae_path = hf_hub_download(
repo_id="caiovicentino1/Qwen3.6-35B-A3B-SAE-L23-topk-wip",
filename="sae_step_22674_92M_tokens.pt",
)
# Load (sae-lens compatible format)
state = torch.load(sae_path, map_location="cpu")
print(f"Keys: {list(state.keys())}")
Citation
If you use this SAE, please cite:
@misc{vicentino2026qwen36sae,
author = {Vicentino, Caio Sanford Guimar{\~a}es},
title = {Qwen3.6-35B-A3B SAE @ L23 β Work in Progress},
year = {2026},
publisher = {HuggingFace},
url = {https://huggingface.co/caiovicentino1/Qwen3.6-35B-A3B-SAE-L23-topk-wip},
}
Methodology lineage
- Top-K + AuxK SAE (Gao et al. 2024, OpenAI) β base methodology
- Gemma Scope (DeepMind 2024) β analogous layer-coverage approach for Gemma family
- Qwen-Scope (Alibaba Apr 2026) β analogous for Qwen3.5/Qwen3 base/MoE families
- SAEBench / InterpScore β for evaluation criteria
- Mechreward Stage 4 (this work) β downstream validation via reasoning-trace correlation
Status / Roadmap
- S4-L0 layer mapping (peak L22, target L23)
- SAE training to 92M tokens (var_exp 0.835)
- G1 correlation pre-test passed (Ο=0.5217)
- Continue training to 200M+ tokens
- G2 tiny RL on contrastive features
- G3 full RL with mech-reward
- Publish probe-validated feature set
- Submit to ICML / NeurIPS Mech Interp Workshop
Part of OpenInterpretability
Built as part of the OpenInterpretability ecosystem β open-source mechanistic interpretability tooling for frontier reasoning models.
Other artifacts:
caiovicentino1/qwen36-27b-sae-papergradeβ paper-grade SAE for Qwen3.6-27B reasoning (dense)caiovicentino1/FabricationGuard-linearprobe-qwen36-27bβ production hallucination probecaiovicentino1/ReasoningGuard-linearprobe-qwen36-27bβ reasoning faithfulness probe (honest narrow scope)- ProbeBench β categorical leaderboard for activation probes
Model tree for caiovicentino1/Qwen3.6-35B-A3B-SAE-L23-topk-wip
Base model
Qwen/Qwen3.6-35B-A3B