KL3M 500M, 7th Gen Model, Checkpoint 15500 (4x Stacked G_stack Training)
A 500M parameter language model trained on multi-domain legal text using G_stack depth expansion (4x cyclic layer duplication from 170M baseline). This checkpoint represents 15,500 steps of continued training on the 120-layer architecture with Muon optimizer and Phase A improvements.
Model Details
- Architecture: Llama-based with Grouped Query Attention (GQA)
- Parameters: 500.3M (487M non-embedding)
- Layers: 120 (4x stacked from 30)
- Source:
alea-institute/kl3m-007-500m-step0(stacked fromkl3m-006-170m-checkpoint-63000) - Training Steps: 15,500 (post-stacking)
- Tokens Processed: 2.16 billion (127M tokens/step at 4096 context)
- Sequence Length: 4,096 tokens
- Precision: BF16
Model Architecture
- Hidden Size: 576 (unchanged from 170M source)
- Layers: 120 (4Γ cyclic duplication from 30)
- Attention Heads: 9 (3 KV heads with GQA)
- Intermediate Size: 1536 (unchanged from source)
- Vocabulary: 131,072 tokens
- RoPE Theta: 100,000
- Parameter Growth: 181.7M β 500.3M (2.75Γ increase)
Training Progress
Training Metrics at Step 15500
- Loss: 2.24 (instantaneous)
- Loss (10-step average): 2.50
- Loss (100-step average): 2.62
- Learning Rate: 0.000254 (cosine decay from depth-scaled 0.000365)
- Gradient Norm: 1.51 (mean: 2.27, max: 15.74 in window 15400-15600)
- Gradient Clipping Events: 0% (stable training)
- Cumulative Tokens: 2,160,107,520
- Cumulative Samples: 527,370
Loss Trajectory
| Step Range | Loss (100-avg) | Improvement |
|---|---|---|
| 1-100 | ~7.66 | Baseline (stacked initialization) |
| 5000-5100 | ~2.63 | Layers diverging, specializing |
| 11000-12000 | ~2.52 | Continued improvement |
| 14000-15000 | ~2.63 | Stabilizing |
| 15400-15600 | 2.62-2.81 | Current |
Overall improvement: ~65% reduction in loss from initialization to step 15500.
G_stack Training Configuration
Stacking Method
Based on "Stacking Your Transformers" (NeurIPS 2024, arXiv:2405.15319):
Stacking Pattern (G_stack = G_direct depthwise):
Source (30 layers): [0, 1, 2, ..., 28, 29]
Target (120 layers): [0-29, 0-29, 0-29, 0-29]
ββ Cyclic repetition 4 times (G_stack/G_direct)
Method: Direct duplication in depthwise manner (G_stack operator from paper)
Training Philosophy:
- Start with proven 170M model (63K steps, 15.83B tokens)
- Stack to 4Γ depth (120 layers) using cyclic duplication
- Continue training with depth-scaled hyperparameters
- Expected benefit: ~54.6% fewer tokens to reach target loss vs training from scratch (per paper)
Optimizer Configuration (Muon with Phase A)
Muon Learning Rate: 0.000365 (depth-scaled)
- Base LR: 0.001
- Depth scaling: β(16/120) = 0.3651
- Scaled LR: 0.001 Γ 0.3651 = 0.000365
Auxiliary Learning Rate: 0.0005
Muon Parameters:
- Weight Decay: 1e-5
- Momentum: 0.95
- NS Steps: 3
- Nesterov: Enabled
Per-Layer Learning Rate Multipliers (Phase A):
self_attn.q_proj: 0.7Γ (slower for better conditioning)self_attn.o_proj: 0.7Γ (slower for better conditioning)self_attn.k_proj: 0.9Γself_attn.v_proj: 0.9Γmlp.*: 1.0Γ (baseline)lm_head: 0.85Γ
Layer-Selective Spectral Clamping (Phase A)
Adjusted for 120-layer model with more frequent attention clamping:
Attention layers (q_proj, k_proj, v_proj, o_proj):
- Frequency: Every 10 steps (higher frequency for deeper model)
- Max condition: 2500
- Sigma floor: 1e-4
MLP layers (gate_proj, up_proj, down_proj):
- Frequency: Every 50 steps
- Max condition: 3000
- Sigma floor: 5e-5
LM head:
- Frequency: Every 50 steps
- Max condition: 2000
- Sigma floor: 1e-4
Batch Configuration
- Micro Batch Size: 5 (reduced from 6 to fit 500M model)
- Gradient Accumulation: 16 steps (increased from 2)
- Effective Batch: 80 samples (5 Γ 16)
- Tokens per Step: ~327,680 (80 samples Γ 4096 tokens)
Additional Regularization
- Adaptive Gradient Clipping: Enabled (Ξ²=0.9, coeff=2.0, threshold=64.0)
- Label Smoothing: 0.01
- Entropy Regularization:
- Entropy bonus weight: 0.003
- Entropy target: 6.5 bits (weight: 0.003)
- Activation norm weight: 0.0006
- Loss chunk size: 1024 tokens
Training Data
Dataset Composition (Steps 15400-15600)
Source: alea-institute/kl3m-data-sample-004-balanced (streaming with buffer=32)
Document Distribution:
- RECAP (Court filings, briefs, motions): 65.75%
- GovInfo (Federal regulations, govt docs): 16.41%
- EDGAR (SEC filings, contracts): 8.20%
- USPTO (Patents): ~1.4%
- eCFR (Regulations): ~0.7%
- Other (FR, FDLP, CAP, etc.): ~7.5%
Data Characteristics:
- Streaming: Enabled with shuffle buffer=32
- Pack Across Records: Enabled (efficient 4K context filling)
- Format: Multi-document spans per 4K context window
Note: The training data is heavily weighted toward court documents (RECAP) which influences the model's generation style and domain expertise.
Layer Evolution Analysis
Analysis of Q-projection weight norms across checkpoints (Steps 14600-15000):
Stack-Level Divergence
Divergence Timeline:
- Step 14600: 6.43%
- Step 14700: 6.44%
- Step 14800: 6.45%
- Step 14900: 6.47%
- Step 15000: 6.48%
Steady increase indicates continued layer specialization and differentiation of the 4 duplicated stacks.
Top Evolving Layers (14600 β 15000)
Fastest evolving (most specialization):
- Layer 26 (Stack 1): +0.104 (+0.58%)
- Layer 22 (Stack 1): +0.095 (+0.57%)
- Layer 111 (Stack 4): +0.082 (+0.49%)
- Layer 112 (Stack 4): +0.079 (+0.48%)
- Layer 50 (Stack 2): +0.076 (+0.48%)
Slowest evolving (most stable):
- Layer 64 (Stack 3): +0.003 (+0.02%)
- Layer 91 (Stack 4): +0.003 (+0.02%)
- Layer 1 (Stack 1): +0.003 (+0.02%)
Stack-Level Changes (14600 β 15000):
- Stack 1 (layers 0-29): +0.034 (+0.21%)
- Stack 2 (layers 30-59): +0.018 (+0.12%)
- Stack 3 (layers 60-89): +0.021 (+0.14%)
- Stack 4 (layers 90-119): +0.026 (+0.16%)
Interpretation: The 4 stacked copies are successfully diverging and specializing, with Stack 1 (early layers) showing the most evolution and middle Stack 3 being most stable.
Generation Quality
Recommended Sampling Parameters
temperature = 0.5
top_p = 0.9
repetition_penalty = 1.2
max_new_tokens = 128
Known Characteristics
Strengths:
- Coherent legal text generation
- Proper document structure understanding
- Multi-domain legal knowledge (court, regulatory, corporate)
- Good vocabulary for legal terminology
Limitations:
- Domain mixing: Model occasionally shifts from contract language to court filing format
- Data bias: Heavy RECAP (court documents) training distribution (65%) vs EDGAR (contracts) (8%) causes the model to default to court filing patterns
- Repetition: Benefits significantly from repetition_penalty=1.2
Context Confusion Example:
- Prompt: "GOVERNING LAW. This Agreement shall be governed by"
- Issue: May generate "IN THE UNITED STATES DISTRICT COURT" headers inappropriately
- Cause: Court documents (RECAP) represent 65% of training data
Recommended Use Cases:
- Legal text understanding and analysis
- Court document processing and summarization
- Multi-domain legal corpus search and retrieval
- Research on model stacking and depth scaling
Not Recommended:
- Pure contract generation (use contract-specific fine-tuned models)
- Production legal document drafting without review
- Cases requiring strict separation of document types
Usage
from transformers import AutoModelForCausalLM, AutoTokenizer
# Load model and tokenizer
model = AutoModelForCausalLM.from_pretrained(
"alea-institute/kl3m-007-500m-checkpoint-15500",
torch_dtype="auto",
device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("alea-institute/kl3m-007-500m-checkpoint-15500")
# Generate with recommended parameters
inputs = tokenizer(
"<|start|>This Agreement is entered into as of",
return_tensors="pt"
).to(model.device)
outputs = model.generate(
**inputs,
max_new_tokens=128,
temperature=0.5,
top_p=0.9,
repetition_penalty=1.2,
do_sample=True
)
print(tokenizer.decode(outputs[0], skip_special_tokens=False))
Pipeline Usage
from transformers import pipeline
generator = pipeline(
"text-generation",
model="alea-institute/kl3m-007-500m-checkpoint-15500",
torch_dtype="auto",
device_map="auto"
)
outputs = generator(
"<|start|>WHEREAS, the parties desire to enter into",
max_new_tokens=128,
temperature=0.5,
top_p=0.9,
repetition_penalty=1.2
)
print(outputs[0]['generated_text'])
Training Infrastructure
- Mixed Precision: BF16
- Gradient Checkpointing: Enabled (non-reentrant)
- Flash Attention: Auto-enabled
- TF32 Mode: Auto
- Optimizer State: Saved with checkpoint
- Tracking: Weights & Biases (wandb)
Model Comparison
| Model | Layers | Params | Steps | Tokens | Loss | Use Case |
|---|---|---|---|---|---|---|
| kl3m-006-170m-checkpoint-63000 | 30 | 181.7M | 63K | 15.83B | 3.18 | Production 170M |
| kl3m-007-500m-step0 | 120 | 500.3M | 0 | 0 | ~8.0 | Stacked init |
| kl3m-007-500m-checkpoint-15500 | 120 | 500.3M | 15.5K | 2.16B | 2.62 | Current |
| kl3m-007-500m-checkpoint-* (future) | 120 | 500.3M | 100K+ | 14B+ | <2.5 | Production 500M target |
Training Trajectory & Expectations
Achieved (Step 15500)
β Successful layer divergence (6.48% cross-stack variance) β Stable gradient norms (mean 2.27, no clipping needed) β 65% loss reduction from initialization β Multi-domain legal text generation capability β Proper spectral conditioning maintained
In Progress
β οΈ Domain mixing issues (court vs contract language) β οΈ Data distribution imbalance affecting generation style β» Continued layer specialization (stacks still differentiating) β» Loss descent toward production target
Expected (Future Checkpoints)
Target Performance (by step 50-100K):
- Loss < 2.5 (matching/exceeding 170M@63K quality)
- Improved domain separation with continued training
- Fuller stack specialization (>10% divergence)
- Production-ready generation quality
G_stack Efficiency Gains (per NeurIPS 2024 paper):
- ~54.6% fewer tokens to reach target loss vs training 120L from scratch
- ~45% computational savings (complementary to token efficiency)
- Expected to match 170M quality with significantly fewer resources
Spectral Health
Inherited from source checkpoint + Phase A regularization:
Attention Layers:
- Max condition: Clamped at 2500 (every 10 steps)
- Median condition: ~2168 (from source)
- Well-conditioned through aggressive clamping frequency
MLP Layers:
- Max condition: Clamped at 3000 (every 50 steps)
- Median: ~5-8 (excellent)
- Inherited stability from 170M source
LM Head:
- Max condition: Clamped at 2000 (every 50 steps)
- Excellent conditioning (~280 from source)
Training Stability: Zero gradient clipping events in recent window indicates healthy optimization landscape.
Next Steps
For Continued Training
- Monitor domain mixing: Track RECAP vs EDGAR influence on generations
- Rebalance data (optional): Consider increasing EDGAR proportion for contract focus
- Target milestones:
- Step 25K: Expect loss ~2.4-2.5
- Step 50K: Expect loss ~2.3-2.4
- Step 100K: Expect production quality (loss <2.3)
For Deployment
- Use with proper sampling parameters: temperature=0.5, top_p=0.9, repetition_penalty=1.2
- Domain-specific prompting: Add context to guide toward contracts vs court documents
- Consider fine-tuning: For domain-specific applications (e.g., pure contract generation)
Stacking Metadata
{
"stacking_method": "G_stack (G_direct depthwise)",
"implementation": "cyclic_duplication",
"paper": "Stacking Your Transformers (NeurIPS 2024, arXiv:2405.15319)",
"authors": "Du et al.",
"source_checkpoint": "checkpoints/muon_170m_phase2/step-00063000",
"growth_factor": 4,
"source_layers": 30,
"target_layers": 120,
"stacking_pattern": "[0-29] repeated 4 times",
"training_steps_post_stack": 15500,
"tokens_processed_post_stack": 2160107520,
"layer_divergence_pct": 6.48,
"expected_efficiency_gain": "54.6% fewer tokens (per paper)"
}
Model Card Authors
Alea Institute
Citation
If you use this model, please cite the G_stack paper:
@inproceedings{gstack2024,
title={Stacking Your Transformers: A Closer Look at Model Growth for Efficient LLM Pre-Training},
author={Du, Wenyu and Luo, Tongxu and Qiu, Zihan and Huang, Zeyu and Shen, Yikang and Cheng, Reynold and Guo, Yike and Fu, Jie},
booktitle={NeurIPS},
year={2024},
note={arXiv:2405.15319}
}
License
Apache 2.0
- Downloads last month
- 21