Instructions to use caiovicentino1/qwen36-27b-sae-multilayer with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use caiovicentino1/qwen36-27b-sae-multilayer with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="caiovicentino1/qwen36-27b-sae-multilayer")# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("caiovicentino1/qwen36-27b-sae-multilayer", dtype="auto") - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use caiovicentino1/qwen36-27b-sae-multilayer with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "caiovicentino1/qwen36-27b-sae-multilayer" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "caiovicentino1/qwen36-27b-sae-multilayer", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker
docker model run hf.co/caiovicentino1/qwen36-27b-sae-multilayer
- SGLang
How to use caiovicentino1/qwen36-27b-sae-multilayer with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "caiovicentino1/qwen36-27b-sae-multilayer" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "caiovicentino1/qwen36-27b-sae-multilayer", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "caiovicentino1/qwen36-27b-sae-multilayer" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "caiovicentino1/qwen36-27b-sae-multilayer", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }' - Docker Model Runner
How to use caiovicentino1/qwen36-27b-sae-multilayer with Docker Model Runner:
docker model run hf.co/caiovicentino1/qwen36-27b-sae-multilayer
Qwen3.6-27B Multi-Layer TopK SAE
First public Sparse Autoencoder on the Qwen3.6 family. Three TopK SAEs on the residual stream at layers L11 / L31 / L55 (early / mid / late Gated-Attention positions) of Qwen/Qwen3.6-27B.
TL;DR
| property | value |
|---|---|
| Base model | Qwen/Qwen3.6-27B (dense, 64 layers, hidden 5120) |
| Layers covered | L11, L31, L55 (all Gated-Attention positions) |
| Architecture | TopK SAE (Anthropic/Jolly Anh recipe) |
| Features per SAE | 4096 |
| Active per token (k) | 32 |
| Training tokens | 200,000 response tokens per layer |
| Variance explained | L11 93.5% / L31 86.4% / L55 77.9% |
| License | Apache-2.0 |
Released 2 days after Qwen3.6-27B (2026-04-21). First SAE on any member of the Qwen3.6 family.
First-in-class claim
Verified via HuggingFace search (April 2026):
- 0 existing Qwen3.6 family SAEs before this repo (145 Qwen3.6 repos exist β all quantizations, fine-tunes, or base weights; none are SAEs)
- Prior Qwen SAE work covers only older dense variants: adamkarvonen's Qwen3 (1.7Bβ32B), hippotoday/gzxiong Qwen3-4B, kroonen-ai Qwen3.5-9B, and our own Qwen3.5-4B.
- This is the first SAE on a hybrid GDN+Gated-Attention architecture with dense FFN.
Training details
Data collection
- 150 greedy rollouts on SuperGPQA multiple-choice questions (seed 100, from MCR Stage B corpus)
- Per-token residuals captured at L11/L31/L55 via forward hooks during generation
- Total response tokens: 355,038; capped to 200,000 per layer (random sample) for SAE training
- Rollout correctness rate: 67.3% (101/150) β labels stored for downstream feature-correctness analysis
SAE architecture
class TopKSAE(nn.Module):
def __init__(self, d_in=5120, n=4096, k=32):
super().__init__()
self.W_enc = nn.Parameter(...) # [d_in, n]
self.b_enc = nn.Parameter(...) # [n]
self.W_dec = nn.Parameter(...) # [n, d_in]
self.b_dec = nn.Parameter(...) # [d_in]
def encode(self, x):
pre = (x - self.b_dec) @ self.W_enc + self.b_enc
top_v, top_i = pre.topk(self.k, dim=-1)
z = torch.zeros_like(pre).scatter_(-1, top_i, F.relu(top_v))
return z
Hyperparameters
| param | value |
|---|---|
| d_model (residual width) | 5120 |
| n_features | 4096 (0.8Γ expansion) |
| k (active features / token) | 32 |
| epochs | 20 |
| batch size | 4096 |
| optimizer | Adam, lr=3e-4 |
| decoder constraint | row-norm projection after each step |
| dead-feature revival | every 5 epochs if no fire in β₯500 steps |
| precision | bf16 activations, fp32 SAE weights |
| hardware | single NVIDIA RTX PRO 6000 Blackwell (96 GB) |
| wall-clock | ~1.5 h for 3 SAEs |
Dead feature revival counts
During training, revival reinitializes unused decoder rows with random unit vectors:
| layer | revived in 20 epochs |
|---|---|
| L11 | 5,636 (features churned through many initializations before settling) |
| L31 | 894 |
| L55 | 54 |
Final healthy features (fire rate β [0.5%, 30%]) per layer: ~750 (L11), ~2,900 (L31), ~3,800 (L55).
Dense vs MoE β cleaner features
Compared to our prior SAE work on Qwen3.6-35B-A3B MoE (caiovicentino1/qwen36-feature-circuits):
| layer | MoE var_expl (35B-A3B) | dense var_expl (27B, this) | Ξ |
|---|---|---|---|
| early (L11) | 0.77 | 0.935 | +16.5 pp |
| mid (L17 MoE / L31 dense) | 0.82 | 0.864 | +4.4 pp |
| late (L23 MoE / L55 dense) | 0.77 | 0.779 | +0.9 pp |
Dense 27B residual streams are more compactly represented by TopK SAE at early layers. The MoE case likely fragments features across 256 experts (9 active per token), whereas dense FFN contributes a single unified computation per layer.
This supports the hypothesis that MoE's apparent "illusory circuits" in our prior 35B paper stem from expert routing fragmentation, not from hybrid-architecture fundamentals.
Feature analysis
Per-feature correctness AUROC
For each SAE, we encoded the 200k training tokens and computed per-feature binary AUROC for correctness (rollout was correct vs wrong). Max AUROC per layer is modest (all within 0.47β0.55):
| layer | max AUROC | feature | fire rate | direction |
|---|---|---|---|---|
| L11 | 0.535 | f4053 | 12% | correctβ |
| L31 | 0.548 | f3383 | 52% | correctβ |
| L55 | 0.509 | f590 | 2.2% | correctβ |
Individual features are weakly predictive on their own β none function as a standalone classifier. The value is in interpretable token-activation semantics below.
Semantic discoveries
Tracing top-activating tokens per feature yields coherent semantic categories:
L11 f2503 (wrongβ, AUROC 0.494) β fires on:
"perfectly","plausible","matches perfectly". Overconfidence marker β when the model writes these words, it often precedes a wrong answer. Analogous to human "cognitive ease" biases.L31 f3383 (correctβ, AUROC 0.548) β fires on:
"Fleet","Arsenal","AHP","Zeng Guofan","Zuo Zongtang". Specific domain terminology recognition. When the model retrieves domain-specific names or technical terms, reasoning aligns with correct answers.L31 f3078 (correctβ) β fires on historical named entities:
"Guofan","Zongtang". Factual recall channel.L31 f452 (wrongβ, AUROC 0.475) β fires on generic domain words:
"engineering","textbook","music". Surface-level pattern matching β model is confusing domains, pre-error signal.L55 f2897 (95% fire rate on generic tokens
"a","the","not") β effectively a noise-floor feature, not informative.
These semantic patterns hold across layers regardless of classifier performance. They demonstrate that the SAEs are learning coherent, interpretable structure from the dense 27B residual stream.
Files
qwen36-27b-sae-multilayer/
βββ sae_L11_n4096_k32.pt (160 MB, SAE state dict)
βββ sae_L31_n4096_k32.pt (160 MB)
βββ sae_L55_n4096_k32.pt (160 MB)
βββ activations.safetensors (5.8 GB, 200k Γ 3 layers Γ 5120 dim, fp16)
βββ labels.npy (per-token correctness labels)
βββ feature_stats.json (AUROC + fire rate for all 4096 features per layer)
βββ feature_characterization.json (top-activating tokens for each top feature)
βββ final_summary.json (training summary)
βββ 01-05_*.png (analysis charts)
Usage
Load SAE for feature analysis
import torch
import torch.nn as nn
from huggingface_hub import snapshot_download
sae_dir = snapshot_download('caiovicentino1/qwen36-27b-sae-multilayer',
allow_patterns=['sae_L*.pt'])
class TopKSAE(nn.Module):
def __init__(self, d_in=5120, n=4096):
super().__init__()
self.W_enc = nn.Parameter(torch.zeros(d_in, n))
self.b_enc = nn.Parameter(torch.zeros(n))
self.W_dec = nn.Parameter(torch.zeros(n, d_in))
self.b_dec = nn.Parameter(torch.zeros(d_in))
def encode(self, x, k=32):
pre = (x - self.b_dec) @ self.W_enc + self.b_enc
top_v, top_i = pre.topk(k, dim=-1)
z = torch.zeros_like(pre).scatter_(-1, top_i, torch.relu(top_v))
return z, top_i, top_v
saes = {}
for L in [11, 31, 55]:
sae = TopKSAE()
sae.load_state_dict(torch.load(f'{sae_dir}/sae_L{L}_n4096_k32.pt',
weights_only=True, map_location='cpu'))
saes[L] = sae
Hook Qwen3.6-27B, encode residuals
from transformers import AutoTokenizer, AutoModelForImageTextToText
tok = AutoTokenizer.from_pretrained('Qwen/Qwen3.6-27B', trust_remote_code=True)
model = AutoModelForImageTextToText.from_pretrained(
'Qwen/Qwen3.6-27B', dtype=torch.bfloat16, device_map='cuda',
attn_implementation='sdpa', trust_remote_code=True)
# Hook L31 residual
chunks = []
def hook(mod, inp, out):
h = out[0] if isinstance(out, tuple) else out
chunks.append(h[:, -1, :].detach().float().cpu())
layer = model.model.language_model.layers[31]
layer.register_forward_hook(hook)
# Generate, then inspect L31 features
prompt = "The main cause of the Self-Strengthening Movement in China was"
ids = tok(prompt, return_tensors='pt').input_ids.cuda()
with torch.no_grad():
model.generate(ids, max_new_tokens=200, do_sample=False,
pad_token_id=tok.pad_token_id, use_cache=True)
# Analyze features at last prompt token
sae = saes[31].cuda()
z, top_i, top_v = sae.encode(chunks[0].cuda())
top_features = top_i[0].tolist()
print(f'Top-10 active L31 features: {top_features[:10]}')
# e.g., check if f3383 ("domain terminology") activates on this history prompt
Limitations
- Per-feature AUROC is weak (max 0.548). Features are interpretable semantically but not strong standalone classifiers. The repo's value is in interpretability primitives, not in providing a correctness predictor.
- Training sample is small (150 rollouts, 200k tokens capped). Larger training set may yield cleaner features and more discriminative signal.
- Single domain β SuperGPQA STEM+humanities MCQ. Out-of-distribution behavior (code, pure math, creative writing) untested.
- No causal validation β we report feature correlations with correctness, not ablation effects. Features that correlate may not causally drive reasoning.
- Inference-time steering interventions were explored but are not reported here. The SAE artifacts stand independently of any downstream steering results.
Related work
| reference | relation |
|---|---|
| caiovicentino1/qwen36-feature-circuits | Prior Qwen3.6-35B-A3B MoE SAE study (negative across 4 substrates). Dense 27B here has much cleaner features (+16.5 pp var_expl at L11). |
| adamkarvonen/qwen3-*-saes | Dense Qwen3 SAEs (1.7Bβ32B). Reference for standard Qwen SAE methodology. |
| Gemma Scope (arXiv:2408.05147) | Canonical large-scale SAE library on Gemma. Our L11 var_expl (93.5%) is on par with their best layers. |
| TopK SAE (arXiv:2406.04093) | Makhzani/OpenAI TopK architecture β the recipe used here. |
Citation
@misc{vicentino2026qwen27bsae,
title = {Qwen3.6-27B Multi-Layer TopK SAE (First SAE on Qwen3.6 Family)},
author = {Vicentino, Caio},
year = {2026},
howpublished = {\url{https://huggingface.co/caiovicentino1/qwen36-27b-sae-multilayer}}
}
License
Apache-2.0 for SAE weights, analysis artifacts, and code. Base model under Apache-2.0 at Qwen/Qwen3.6-27B.
Model tree for caiovicentino1/qwen36-27b-sae-multilayer
Base model
Qwen/Qwen3.6-27B



