KL3M 170M, 6th Gen Model, 33K Checkpoint
A 170M parameter language model trained on legal agreements using the Muon optimizer with spectral clamping.
Model Details
- Architecture: Llama-based with Grouped Query Attention (GQA)
- Parameters: 181.7M (170M non-embedding)
- Training Steps: 33,000
- Sequence Length: 4,096 tokens
- Precision: BF16
- Optimizer: Muon with spectral regularization (max condition: 2000)
Model Architecture
- Hidden Size: 576
- Layers: 30
- Attention Heads: 9 (3 KV heads with GQA)
- Intermediate Size: 1536
- Vocabulary: 131,072 tokens
Training Configuration
- Dataset: Legal agreements (EDGAR filings)
- Optimizer: Muon with momentum 0.95
- Muon Learning Rate: 8e-5 (depth-scaled)
- Auxiliary Learning Rate: 4e-5
- Batch Size: 1 per device (effective 4 with gradient accumulation)
- Gradient Accumulation Steps: 4
- Warmup Steps: 10,000
- LR Scheduler: Cosine with warmup
- Weight Decay: Muon 1e-5, Auxiliary 0.001
- Spectral Clamping: Enabled (max condition 2000, sigma floor 6e-4, every 10 steps)
- Mixed Precision: BF16
- Gradient Checkpointing: Enabled
- Additional Regularization:
- Entropy bonus weight: 0.001 (target: 6.5 bits)
- Activation norm weight: 0.001
- Loss chunk tokens: 1024
Spectral Health (Step 33K)
- Attention Layers: Median condition number 237.75 โ EXCELLENT
- MLP Layers: Median condition number 4.58 โ EXCELLENT
- Max Attention Condition: 2208.16 (at spectral clamp ceiling)
Generation Quality
Generates coherent, fluent legal text with no repetition issues. Best for legal/contractual content.
Usage
from transformers import pipeline
# Create text generation pipeline
generator = pipeline(
"text-generation",
model="alea-institute/kl3m-006-170m-checkpoint-33000",
torch_dtype="auto",
device_map="auto"
)
# Generate text
outputs = generator(
"This Agreement is entered into as of",
max_new_tokens=100,
temperature=0.7,
top_p=0.95,
repetition_penalty=1.15
)
print(outputs[0]['generated_text'])
Citation
For technical details, see the paper: https://arxiv.org/abs/2504.07854
@misc{kl3m2025,
title={KL3M: Knowledge-Guided Language Model Training},
author={Alea Institute},
year={2025},
url={https://arxiv.org/abs/2504.07854},
note={Trained with Muon optimizer and spectral clamping}
}
License
Apache 2.0
- Downloads last month
- 34