KL3M 170M, 6th Gen Model, 33K Checkpoint

A 170M parameter language model trained on legal agreements using the Muon optimizer with spectral clamping.

Model Details

  • Architecture: Llama-based with Grouped Query Attention (GQA)
  • Parameters: 181.7M (170M non-embedding)
  • Training Steps: 33,000
  • Sequence Length: 4,096 tokens
  • Precision: BF16
  • Optimizer: Muon with spectral regularization (max condition: 2000)

Model Architecture

  • Hidden Size: 576
  • Layers: 30
  • Attention Heads: 9 (3 KV heads with GQA)
  • Intermediate Size: 1536
  • Vocabulary: 131,072 tokens

Training Configuration

  • Dataset: Legal agreements (EDGAR filings)
  • Optimizer: Muon with momentum 0.95
  • Muon Learning Rate: 8e-5 (depth-scaled)
  • Auxiliary Learning Rate: 4e-5
  • Batch Size: 1 per device (effective 4 with gradient accumulation)
  • Gradient Accumulation Steps: 4
  • Warmup Steps: 10,000
  • LR Scheduler: Cosine with warmup
  • Weight Decay: Muon 1e-5, Auxiliary 0.001
  • Spectral Clamping: Enabled (max condition 2000, sigma floor 6e-4, every 10 steps)
  • Mixed Precision: BF16
  • Gradient Checkpointing: Enabled
  • Additional Regularization:
    • Entropy bonus weight: 0.001 (target: 6.5 bits)
    • Activation norm weight: 0.001
    • Loss chunk tokens: 1024

Spectral Health (Step 33K)

  • Attention Layers: Median condition number 237.75 โœ“ EXCELLENT
  • MLP Layers: Median condition number 4.58 โœ“ EXCELLENT
  • Max Attention Condition: 2208.16 (at spectral clamp ceiling)

Generation Quality

Generates coherent, fluent legal text with no repetition issues. Best for legal/contractual content.

Usage

from transformers import pipeline

# Create text generation pipeline
generator = pipeline(
    "text-generation",
    model="alea-institute/kl3m-006-170m-checkpoint-33000",
    torch_dtype="auto",
    device_map="auto"
)

# Generate text
outputs = generator(
    "This Agreement is entered into as of",
    max_new_tokens=100,
    temperature=0.7,
    top_p=0.95,
    repetition_penalty=1.15
)

print(outputs[0]['generated_text'])

Citation

For technical details, see the paper: https://arxiv.org/abs/2504.07854

@misc{kl3m2025,
  title={KL3M: Knowledge-Guided Language Model Training},
  author={Alea Institute},
  year={2025},
  url={https://arxiv.org/abs/2504.07854},
  note={Trained with Muon optimizer and spectral clamping}
}

License

Apache 2.0

Downloads last month
34
Safetensors
Model size
0.2B params
Tensor type
BF16
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support