MiniTransformer v3
A small educational transformer model trained from scratch for text generation tasks.
Model Description
MiniTransformer is a compact transformer architecture designed for educational purposes and experimentation. The model is trained on question-answer pairs with various system prompts to demonstrate fundamental transformer capabilities.
This is an educational model - it's designed to help understand transformer architectures and training processes, not for production use.
Architecture
- Parameters: 43.9M
- Architecture: Decoder-only transformer
- Embedding Dimension: 512
- Attention Heads: 4
- Layers: 4
- Context Length: 128 tokens
- Vocabulary: BERT tokenizer (30,522 tokens)
Training Details
Training Data
- Generic question-answer pairs with diverse system prompts
- Trained using sliding window approach with stride of 32
- Train/test split: 90/10
Training Procedure
- Optimizer: AdamW (fused, learning rate: 3e-4)
- Batch Size: 128
- Epochs: 50
- Mixed Precision: FP16 (AMP enabled)
- Hardware: NVIDIA A10 GPU
- Final Train Loss: 0.0024
Framework
- PyTorch 2.0+ with
torch.compile()optimization - Transformers library tokenizer
Usage
import torch
from transformers import AutoTokenizer
# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
# Load model (you'll need to download the checkpoint)
# model = MiniTransformer(...)
# model.load_state_dict(torch.load("checkpoint.pt"))
# Generate text
input_text = "Your prompt here"
input_ids = tokenizer.encode(input_text, return_tensors="pt")
# Generation code here