Qwen3.6-27B Multi-Layer TopK SAE

First public Sparse Autoencoder on the Qwen3.6 family. Three TopK SAEs on the residual stream at layers L11 / L31 / L55 (early / mid / late Gated-Attention positions) of Qwen/Qwen3.6-27B.

SAE variance explained


TL;DR

property value
Base model Qwen/Qwen3.6-27B (dense, 64 layers, hidden 5120)
Layers covered L11, L31, L55 (all Gated-Attention positions)
Architecture TopK SAE (Anthropic/Jolly Anh recipe)
Features per SAE 4096
Active per token (k) 32
Training tokens 200,000 response tokens per layer
Variance explained L11 93.5% / L31 86.4% / L55 77.9%
License Apache-2.0

Released 2 days after Qwen3.6-27B (2026-04-21). First SAE on any member of the Qwen3.6 family.


First-in-class claim

Verified via HuggingFace search (April 2026):

  • 0 existing Qwen3.6 family SAEs before this repo (145 Qwen3.6 repos exist β€” all quantizations, fine-tunes, or base weights; none are SAEs)
  • Prior Qwen SAE work covers only older dense variants: adamkarvonen's Qwen3 (1.7B–32B), hippotoday/gzxiong Qwen3-4B, kroonen-ai Qwen3.5-9B, and our own Qwen3.5-4B.
  • This is the first SAE on a hybrid GDN+Gated-Attention architecture with dense FFN.

Qwen SAE landscape


Training details

Data collection

  • 150 greedy rollouts on SuperGPQA multiple-choice questions (seed 100, from MCR Stage B corpus)
  • Per-token residuals captured at L11/L31/L55 via forward hooks during generation
  • Total response tokens: 355,038; capped to 200,000 per layer (random sample) for SAE training
  • Rollout correctness rate: 67.3% (101/150) β€” labels stored for downstream feature-correctness analysis

SAE architecture

class TopKSAE(nn.Module):
    def __init__(self, d_in=5120, n=4096, k=32):
        super().__init__()
        self.W_enc = nn.Parameter(...)  # [d_in, n]
        self.b_enc = nn.Parameter(...)  # [n]
        self.W_dec = nn.Parameter(...)  # [n, d_in]
        self.b_dec = nn.Parameter(...)  # [d_in]

    def encode(self, x):
        pre = (x - self.b_dec) @ self.W_enc + self.b_enc
        top_v, top_i = pre.topk(self.k, dim=-1)
        z = torch.zeros_like(pre).scatter_(-1, top_i, F.relu(top_v))
        return z

Hyperparameters

param value
d_model (residual width) 5120
n_features 4096 (0.8Γ— expansion)
k (active features / token) 32
epochs 20
batch size 4096
optimizer Adam, lr=3e-4
decoder constraint row-norm projection after each step
dead-feature revival every 5 epochs if no fire in β‰₯500 steps
precision bf16 activations, fp32 SAE weights
hardware single NVIDIA RTX PRO 6000 Blackwell (96 GB)
wall-clock ~1.5 h for 3 SAEs

Dead feature revival counts

During training, revival reinitializes unused decoder rows with random unit vectors:

layer revived in 20 epochs
L11 5,636 (features churned through many initializations before settling)
L31 894
L55 54

Final healthy features (fire rate ∈ [0.5%, 30%]) per layer: ~750 (L11), ~2,900 (L31), ~3,800 (L55).


Dense vs MoE β€” cleaner features

Dense vs MoE

Compared to our prior SAE work on Qwen3.6-35B-A3B MoE (caiovicentino1/qwen36-feature-circuits):

layer MoE var_expl (35B-A3B) dense var_expl (27B, this) Ξ”
early (L11) 0.77 0.935 +16.5 pp
mid (L17 MoE / L31 dense) 0.82 0.864 +4.4 pp
late (L23 MoE / L55 dense) 0.77 0.779 +0.9 pp

Dense 27B residual streams are more compactly represented by TopK SAE at early layers. The MoE case likely fragments features across 256 experts (9 active per token), whereas dense FFN contributes a single unified computation per layer.

This supports the hypothesis that MoE's apparent "illusory circuits" in our prior 35B paper stem from expert routing fragmentation, not from hybrid-architecture fundamentals.


Feature analysis

Per-feature correctness AUROC

For each SAE, we encoded the 200k training tokens and computed per-feature binary AUROC for correctness (rollout was correct vs wrong). Max AUROC per layer is modest (all within 0.47–0.55):

Feature AUROC

layer max AUROC feature fire rate direction
L11 0.535 f4053 12% correct↑
L31 0.548 f3383 52% correct↑
L55 0.509 f590 2.2% correct↑

Individual features are weakly predictive on their own β€” none function as a standalone classifier. The value is in interpretable token-activation semantics below.

Semantic discoveries

Semantic findings

Tracing top-activating tokens per feature yields coherent semantic categories:

  • L11 f2503 (wrong↑, AUROC 0.494) β€” fires on: "perfectly", "plausible", "matches perfectly". Overconfidence marker β€” when the model writes these words, it often precedes a wrong answer. Analogous to human "cognitive ease" biases.

  • L31 f3383 (correct↑, AUROC 0.548) β€” fires on: "Fleet", "Arsenal", "AHP", "Zeng Guofan", "Zuo Zongtang". Specific domain terminology recognition. When the model retrieves domain-specific names or technical terms, reasoning aligns with correct answers.

  • L31 f3078 (correct↑) β€” fires on historical named entities: "Guofan", "Zongtang". Factual recall channel.

  • L31 f452 (wrong↑, AUROC 0.475) β€” fires on generic domain words: "engineering", "textbook", "music". Surface-level pattern matching β€” model is confusing domains, pre-error signal.

  • L55 f2897 (95% fire rate on generic tokens "a", "the", "not") β€” effectively a noise-floor feature, not informative.

These semantic patterns hold across layers regardless of classifier performance. They demonstrate that the SAEs are learning coherent, interpretable structure from the dense 27B residual stream.


Files

qwen36-27b-sae-multilayer/
β”œβ”€β”€ sae_L11_n4096_k32.pt          (160 MB, SAE state dict)
β”œβ”€β”€ sae_L31_n4096_k32.pt          (160 MB)
β”œβ”€β”€ sae_L55_n4096_k32.pt          (160 MB)
β”œβ”€β”€ activations.safetensors       (5.8 GB, 200k Γ— 3 layers Γ— 5120 dim, fp16)
β”œβ”€β”€ labels.npy                    (per-token correctness labels)
β”œβ”€β”€ feature_stats.json            (AUROC + fire rate for all 4096 features per layer)
β”œβ”€β”€ feature_characterization.json (top-activating tokens for each top feature)
β”œβ”€β”€ final_summary.json            (training summary)
└── 01-05_*.png                   (analysis charts)

Usage

Load SAE for feature analysis

import torch
import torch.nn as nn
from huggingface_hub import snapshot_download

sae_dir = snapshot_download('caiovicentino1/qwen36-27b-sae-multilayer',
                             allow_patterns=['sae_L*.pt'])

class TopKSAE(nn.Module):
    def __init__(self, d_in=5120, n=4096):
        super().__init__()
        self.W_enc = nn.Parameter(torch.zeros(d_in, n))
        self.b_enc = nn.Parameter(torch.zeros(n))
        self.W_dec = nn.Parameter(torch.zeros(n, d_in))
        self.b_dec = nn.Parameter(torch.zeros(d_in))

    def encode(self, x, k=32):
        pre = (x - self.b_dec) @ self.W_enc + self.b_enc
        top_v, top_i = pre.topk(k, dim=-1)
        z = torch.zeros_like(pre).scatter_(-1, top_i, torch.relu(top_v))
        return z, top_i, top_v

saes = {}
for L in [11, 31, 55]:
    sae = TopKSAE()
    sae.load_state_dict(torch.load(f'{sae_dir}/sae_L{L}_n4096_k32.pt',
                                     weights_only=True, map_location='cpu'))
    saes[L] = sae

Hook Qwen3.6-27B, encode residuals

from transformers import AutoTokenizer, AutoModelForImageTextToText

tok = AutoTokenizer.from_pretrained('Qwen/Qwen3.6-27B', trust_remote_code=True)
model = AutoModelForImageTextToText.from_pretrained(
    'Qwen/Qwen3.6-27B', dtype=torch.bfloat16, device_map='cuda',
    attn_implementation='sdpa', trust_remote_code=True)

# Hook L31 residual
chunks = []
def hook(mod, inp, out):
    h = out[0] if isinstance(out, tuple) else out
    chunks.append(h[:, -1, :].detach().float().cpu())
layer = model.model.language_model.layers[31]
layer.register_forward_hook(hook)

# Generate, then inspect L31 features
prompt = "The main cause of the Self-Strengthening Movement in China was"
ids = tok(prompt, return_tensors='pt').input_ids.cuda()
with torch.no_grad():
    model.generate(ids, max_new_tokens=200, do_sample=False,
                    pad_token_id=tok.pad_token_id, use_cache=True)

# Analyze features at last prompt token
sae = saes[31].cuda()
z, top_i, top_v = sae.encode(chunks[0].cuda())
top_features = top_i[0].tolist()
print(f'Top-10 active L31 features: {top_features[:10]}')
# e.g., check if f3383 ("domain terminology") activates on this history prompt

Limitations

  • Per-feature AUROC is weak (max 0.548). Features are interpretable semantically but not strong standalone classifiers. The repo's value is in interpretability primitives, not in providing a correctness predictor.
  • Training sample is small (150 rollouts, 200k tokens capped). Larger training set may yield cleaner features and more discriminative signal.
  • Single domain β€” SuperGPQA STEM+humanities MCQ. Out-of-distribution behavior (code, pure math, creative writing) untested.
  • No causal validation β€” we report feature correlations with correctness, not ablation effects. Features that correlate may not causally drive reasoning.
  • Inference-time steering interventions were explored but are not reported here. The SAE artifacts stand independently of any downstream steering results.

Related work

reference relation
caiovicentino1/qwen36-feature-circuits Prior Qwen3.6-35B-A3B MoE SAE study (negative across 4 substrates). Dense 27B here has much cleaner features (+16.5 pp var_expl at L11).
adamkarvonen/qwen3-*-saes Dense Qwen3 SAEs (1.7B–32B). Reference for standard Qwen SAE methodology.
Gemma Scope (arXiv:2408.05147) Canonical large-scale SAE library on Gemma. Our L11 var_expl (93.5%) is on par with their best layers.
TopK SAE (arXiv:2406.04093) Makhzani/OpenAI TopK architecture β€” the recipe used here.

Citation

@misc{vicentino2026qwen27bsae,
  title   = {Qwen3.6-27B Multi-Layer TopK SAE (First SAE on Qwen3.6 Family)},
  author  = {Vicentino, Caio},
  year    = {2026},
  howpublished = {\url{https://huggingface.co/caiovicentino1/qwen36-27b-sae-multilayer}}
}

License

Apache-2.0 for SAE weights, analysis artifacts, and code. Base model under Apache-2.0 at Qwen/Qwen3.6-27B.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for caiovicentino1/qwen36-27b-sae-multilayer

Base model

Qwen/Qwen3.6-27B
Finetuned
(202)
this model

Space using caiovicentino1/qwen36-27b-sae-multilayer 1

Papers for caiovicentino1/qwen36-27b-sae-multilayer