Butanium's picture
Upload README.md with huggingface_hub
0160fd9 verified
# 4-Layer 8-Head Attention-Only Transformer
This is a simplified transformer model with 4 attention layer(s) and 8 attention head(s), hidden size 128, designed for studying attention mechanisms in isolation.
## Architecture Differences from Vanilla Transformer
**Removed Components:**
- **No MLP/Feed-Forward layers** - Only attention layers
- **No Layer Normalization** - No LayerNorm before/after attention
- **No positional encoding** - No position embeddings of any kind
**Kept Components:**
- Token embeddings
- Multi-head self-attention with causal masking
- Residual connections around attention layers
- Language modeling head (linear projection to vocabulary)
This minimal architecture isolates the attention mechanism, making it useful for mechanistic interpretability research as described in [A Mathematical Framework for Transformer Circuits](https://transformer-circuits.pub/2021/framework/index.html).
## Usage
```python
config_class = LlamaConfig
def __init__(self, config: LlamaConfig):
super().__init__(config)
self.embed_tokens = nn.Embedding(config.vocab_size, config.hidden_size)
self.layers = nn.ModuleList([AttentionLayer(config) for _ in range(config.num_hidden_layers)])
self.lm_head = nn.Linear(config.hidden_size, config.vocab_size, bias=False)
model = AttentionOnlyTransformer.from_pretrained('Butanium/simple-stories-4L8H128D-attention-only-toy-transformer')
```
## Training Data
The model is trained on the [SimpleStories dataset](https://huggingface.co/datasets/SimpleStories/SimpleStories) for next-token prediction.