Instructions to use caiovicentino1/qwen36-27b-sae-multilayer with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use caiovicentino1/qwen36-27b-sae-multilayer with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="caiovicentino1/qwen36-27b-sae-multilayer")

# Load model directly
from transformers import AutoModel
model = AutoModel.from_pretrained("caiovicentino1/qwen36-27b-sae-multilayer", dtype="auto")

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use caiovicentino1/qwen36-27b-sae-multilayer with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "caiovicentino1/qwen36-27b-sae-multilayer"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "caiovicentino1/qwen36-27b-sae-multilayer",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker

docker model run hf.co/caiovicentino1/qwen36-27b-sae-multilayer

SGLang

How to use caiovicentino1/qwen36-27b-sae-multilayer with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "caiovicentino1/qwen36-27b-sae-multilayer" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "caiovicentino1/qwen36-27b-sae-multilayer",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "caiovicentino1/qwen36-27b-sae-multilayer" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "caiovicentino1/qwen36-27b-sae-multilayer",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Docker Model Runner
How to use caiovicentino1/qwen36-27b-sae-multilayer with Docker Model Runner:
```
docker model run hf.co/caiovicentino1/qwen36-27b-sae-multilayer
```

Qwen3.6-27B Multi-Layer TopK SAE

First public Sparse Autoencoder on the Qwen3.6 family. Three TopK SAEs on the residual stream at layers L11 / L31 / L55 (early / mid / late Gated-Attention positions) of Qwen/Qwen3.6-27B.

TL;DR

property	value
Base model	Qwen/Qwen3.6-27B (dense, 64 layers, hidden 5120)
Layers covered	L11, L31, L55 (all Gated-Attention positions)
Architecture	TopK SAE (Anthropic/Jolly Anh recipe)
Features per SAE	4096
Active per token (k)	32
Training tokens	200,000 response tokens per layer
Variance explained	L11 93.5% / L31 86.4% / L55 77.9%
License	Apache-2.0

Released 2 days after Qwen3.6-27B (2026-04-21). First SAE on any member of the Qwen3.6 family.

First-in-class claim

Verified via HuggingFace search (April 2026):

0 existing Qwen3.6 family SAEs before this repo (145 Qwen3.6 repos exist — all quantizations, fine-tunes, or base weights; none are SAEs)
Prior Qwen SAE work covers only older dense variants: adamkarvonen's Qwen3 (1.7B–32B), hippotoday/gzxiong Qwen3-4B, kroonen-ai Qwen3.5-9B, and our own Qwen3.5-4B.
This is the first SAE on a hybrid GDN+Gated-Attention architecture with dense FFN.

Training details

Data collection

150 greedy rollouts on SuperGPQA multiple-choice questions (seed 100, from MCR Stage B corpus)
Per-token residuals captured at L11/L31/L55 via forward hooks during generation
Total response tokens: 355,038; capped to 200,000 per layer (random sample) for SAE training
Rollout correctness rate: 67.3% (101/150) — labels stored for downstream feature-correctness analysis

SAE architecture

class TopKSAE(nn.Module):
    def __init__(self, d_in=5120, n=4096, k=32):
        super().__init__()
        self.W_enc = nn.Parameter(...)  # [d_in, n]
        self.b_enc = nn.Parameter(...)  # [n]
        self.W_dec = nn.Parameter(...)  # [n, d_in]
        self.b_dec = nn.Parameter(...)  # [d_in]

    def encode(self, x):
        pre = (x - self.b_dec) @ self.W_enc + self.b_enc
        top_v, top_i = pre.topk(self.k, dim=-1)
        z = torch.zeros_like(pre).scatter_(-1, top_i, F.relu(top_v))
        return z

Hyperparameters

param	value
d_model (residual width)	5120
n_features	4096 (0.8× expansion)
k (active features / token)	32
epochs	20
batch size	4096
optimizer	Adam, lr=3e-4
decoder constraint	row-norm projection after each step
dead-feature revival	every 5 epochs if no fire in ≥500 steps
precision	bf16 activations, fp32 SAE weights
hardware	single NVIDIA RTX PRO 6000 Blackwell (96 GB)
wall-clock	~1.5 h for 3 SAEs

Dead feature revival counts

During training, revival reinitializes unused decoder rows with random unit vectors:

layer	revived in 20 epochs
L11	5,636 (features churned through many initializations before settling)
L31	894
L55	54

Final healthy features (fire rate ∈ [0.5%, 30%]) per layer: ~750 (L11), ~2,900 (L31), ~3,800 (L55).

Dense vs MoE — cleaner features

Compared to our prior SAE work on Qwen3.6-35B-A3B MoE (caiovicentino1/qwen36-feature-circuits):

layer	MoE var_expl (35B-A3B)	dense var_expl (27B, this)	Δ
early (L11)	0.77	0.935	+16.5 pp
mid (L17 MoE / L31 dense)	0.82	0.864	+4.4 pp
late (L23 MoE / L55 dense)	0.77	0.779	+0.9 pp

Dense 27B residual streams are more compactly represented by TopK SAE at early layers. The MoE case likely fragments features across 256 experts (9 active per token), whereas dense FFN contributes a single unified computation per layer.

This supports the hypothesis that MoE's apparent "illusory circuits" in our prior 35B paper stem from expert routing fragmentation, not from hybrid-architecture fundamentals.

Feature analysis

Per-feature correctness AUROC

For each SAE, we encoded the 200k training tokens and computed per-feature binary AUROC for correctness (rollout was correct vs wrong). Max AUROC per layer is modest (all within 0.47–0.55):

layer	max AUROC	feature	fire rate	direction
L11	0.535	f4053	12%	correct↑
L31	0.548	f3383	52%	correct↑
L55	0.509	f590	2.2%	correct↑

Individual features are weakly predictive on their own — none function as a standalone classifier. The value is in interpretable token-activation semantics below.

Semantic discoveries

Tracing top-activating tokens per feature yields coherent semantic categories:

L11 f2503 (wrong↑, AUROC 0.494) — fires on: "perfectly", "plausible", "matches perfectly". Overconfidence marker — when the model writes these words, it often precedes a wrong answer. Analogous to human "cognitive ease" biases.
L31 f3383 (correct↑, AUROC 0.548) — fires on: "Fleet", "Arsenal", "AHP", "Zeng Guofan", "Zuo Zongtang". Specific domain terminology recognition. When the model retrieves domain-specific names or technical terms, reasoning aligns with correct answers.
L31 f3078 (correct↑) — fires on historical named entities: "Guofan", "Zongtang". Factual recall channel.
L31 f452 (wrong↑, AUROC 0.475) — fires on generic domain words: "engineering", "textbook", "music". Surface-level pattern matching — model is confusing domains, pre-error signal.
L55 f2897 (95% fire rate on generic tokens "a", "the", "not") — effectively a noise-floor feature, not informative.

These semantic patterns hold across layers regardless of classifier performance. They demonstrate that the SAEs are learning coherent, interpretable structure from the dense 27B residual stream.

Files

qwen36-27b-sae-multilayer/
├── sae_L11_n4096_k32.pt          (160 MB, SAE state dict)
├── sae_L31_n4096_k32.pt          (160 MB)
├── sae_L55_n4096_k32.pt          (160 MB)
├── activations.safetensors       (5.8 GB, 200k × 3 layers × 5120 dim, fp16)
├── labels.npy                    (per-token correctness labels)
├── feature_stats.json            (AUROC + fire rate for all 4096 features per layer)
├── feature_characterization.json (top-activating tokens for each top feature)
├── final_summary.json            (training summary)
└── 01-05_*.png                   (analysis charts)

Usage

Load SAE for feature analysis

import torch
import torch.nn as nn
from huggingface_hub import snapshot_download

sae_dir = snapshot_download('caiovicentino1/qwen36-27b-sae-multilayer',
                             allow_patterns=['sae_L*.pt'])

class TopKSAE(nn.Module):
    def __init__(self, d_in=5120, n=4096):
        super().__init__()
        self.W_enc = nn.Parameter(torch.zeros(d_in, n))
        self.b_enc = nn.Parameter(torch.zeros(n))
        self.W_dec = nn.Parameter(torch.zeros(n, d_in))
        self.b_dec = nn.Parameter(torch.zeros(d_in))

    def encode(self, x, k=32):
        pre = (x - self.b_dec) @ self.W_enc + self.b_enc
        top_v, top_i = pre.topk(k, dim=-1)
        z = torch.zeros_like(pre).scatter_(-1, top_i, torch.relu(top_v))
        return z, top_i, top_v

saes = {}
for L in [11, 31, 55]:
    sae = TopKSAE()
    sae.load_state_dict(torch.load(f'{sae_dir}/sae_L{L}_n4096_k32.pt',
                                     weights_only=True, map_location='cpu'))
    saes[L] = sae

Hook Qwen3.6-27B, encode residuals

from transformers import AutoTokenizer, AutoModelForImageTextToText

tok = AutoTokenizer.from_pretrained('Qwen/Qwen3.6-27B', trust_remote_code=True)
model = AutoModelForImageTextToText.from_pretrained(
    'Qwen/Qwen3.6-27B', dtype=torch.bfloat16, device_map='cuda',
    attn_implementation='sdpa', trust_remote_code=True)

# Hook L31 residual
chunks = []
def hook(mod, inp, out):
    h = out[0] if isinstance(out, tuple) else out
    chunks.append(h[:, -1, :].detach().float().cpu())
layer = model.model.language_model.layers[31]
layer.register_forward_hook(hook)

# Generate, then inspect L31 features
prompt = "The main cause of the Self-Strengthening Movement in China was"
ids = tok(prompt, return_tensors='pt').input_ids.cuda()
with torch.no_grad():
    model.generate(ids, max_new_tokens=200, do_sample=False,
                    pad_token_id=tok.pad_token_id, use_cache=True)

# Analyze features at last prompt token
sae = saes[31].cuda()
z, top_i, top_v = sae.encode(chunks[0].cuda())
top_features = top_i[0].tolist()
print(f'Top-10 active L31 features: {top_features[:10]}')
# e.g., check if f3383 ("domain terminology") activates on this history prompt

Limitations

Per-feature AUROC is weak (max 0.548). Features are interpretable semantically but not strong standalone classifiers. The repo's value is in interpretability primitives, not in providing a correctness predictor.
Training sample is small (150 rollouts, 200k tokens capped). Larger training set may yield cleaner features and more discriminative signal.
Single domain — SuperGPQA STEM+humanities MCQ. Out-of-distribution behavior (code, pure math, creative writing) untested.
No causal validation — we report feature correlations with correctness, not ablation effects. Features that correlate may not causally drive reasoning.
Inference-time steering interventions were explored but are not reported here. The SAE artifacts stand independently of any downstream steering results.

Related work

reference	relation
caiovicentino1/qwen36-feature-circuits	Prior Qwen3.6-35B-A3B MoE SAE study (negative across 4 substrates). Dense 27B here has much cleaner features (+16.5 pp var_expl at L11).
adamkarvonen/qwen3-*-saes	Dense Qwen3 SAEs (1.7B–32B). Reference for standard Qwen SAE methodology.
Gemma Scope (arXiv:2408.05147)	Canonical large-scale SAE library on Gemma. Our L11 var_expl (93.5%) is on par with their best layers.
TopK SAE (arXiv:2406.04093)	Makhzani/OpenAI TopK architecture — the recipe used here.

Citation

@misc{vicentino2026qwen27bsae,
  title   = {Qwen3.6-27B Multi-Layer TopK SAE (First SAE on Qwen3.6 Family)},
  author  = {Vicentino, Caio},
  year    = {2026},
  howpublished = {\url{https://huggingface.co/caiovicentino1/qwen36-27b-sae-multilayer}}
}

License

Apache-2.0 for SAE weights, analysis artifacts, and code. Base model under Apache-2.0 at Qwen/Qwen3.6-27B.

Downloads last month: -; Downloads are not tracked for this model. How to track

Model tree for caiovicentino1/qwen36-27b-sae-multilayer

Base model

Qwen/Qwen3.6-27B

Finetuned

(202)

this model

Space using caiovicentino1/qwen36-27b-sae-multilayer 1

Papers for caiovicentino1/qwen36-27b-sae-multilayer

Gemma Scope: Open Sparse Autoencoders Everywhere All At Once on Gemma 2

Paper • 2408.05147 • Published Aug 9, 2024 • 42

Scaling and evaluating sparse autoencoders

Paper • 2406.04093 • Published Jun 6, 2024 • 4