| # 4-Layer 8-Head Attention-Only Transformer | |
| This is a simplified transformer model with 4 attention layer(s) and 8 attention head(s), hidden size 128, designed for studying attention mechanisms in isolation. | |
| ## Architecture Differences from Vanilla Transformer | |
| **Removed Components:** | |
| - **No MLP/Feed-Forward layers** - Only attention layers | |
| - **No Layer Normalization** - No LayerNorm before/after attention | |
| - **No positional encoding** - No position embeddings of any kind | |
| **Kept Components:** | |
| - Token embeddings | |
| - Multi-head self-attention with causal masking | |
| - Residual connections around attention layers | |
| - Language modeling head (linear projection to vocabulary) | |
| This minimal architecture isolates the attention mechanism, making it useful for mechanistic interpretability research as described in [A Mathematical Framework for Transformer Circuits](https://transformer-circuits.pub/2021/framework/index.html). | |
| ## Usage | |
| ```python | |
| config_class = LlamaConfig | |
| def __init__(self, config: LlamaConfig): | |
| super().__init__(config) | |
| self.embed_tokens = nn.Embedding(config.vocab_size, config.hidden_size) | |
| self.layers = nn.ModuleList([AttentionLayer(config) for _ in range(config.num_hidden_layers)]) | |
| self.lm_head = nn.Linear(config.hidden_size, config.vocab_size, bias=False) | |
| model = AttentionOnlyTransformer.from_pretrained('Butanium/simple-stories-4L8H128D-attention-only-toy-transformer') | |
| ``` | |
| ## Training Data | |
| The model is trained on the [SimpleStories dataset](https://huggingface.co/datasets/SimpleStories/SimpleStories) for next-token prediction. |