# 4-Layer 8-Head Attention-Only Transformer This is a simplified transformer model with 4 attention layer(s) and 8 attention head(s), hidden size 128, designed for studying attention mechanisms in isolation. ## Architecture Differences from Vanilla Transformer **Removed Components:** - **No MLP/Feed-Forward layers** - Only attention layers - **No Layer Normalization** - No LayerNorm before/after attention - **No positional encoding** - No position embeddings of any kind **Kept Components:** - Token embeddings - Multi-head self-attention with causal masking - Residual connections around attention layers - Language modeling head (linear projection to vocabulary) This minimal architecture isolates the attention mechanism, making it useful for mechanistic interpretability research as described in [A Mathematical Framework for Transformer Circuits](https://transformer-circuits.pub/2021/framework/index.html). ## Usage ```python config_class = LlamaConfig def __init__(self, config: LlamaConfig): super().__init__(config) self.embed_tokens = nn.Embedding(config.vocab_size, config.hidden_size) self.layers = nn.ModuleList([AttentionLayer(config) for _ in range(config.num_hidden_layers)]) self.lm_head = nn.Linear(config.hidden_size, config.vocab_size, bias=False) model = AttentionOnlyTransformer.from_pretrained('Butanium/simple-stories-4L8H128D-attention-only-toy-transformer') ``` ## Training Data The model is trained on the [SimpleStories dataset](https://huggingface.co/datasets/SimpleStories/SimpleStories) for next-token prediction.