KL3M 500M, 7th Gen Model, Checkpoint 15500 (4x Stacked G_stack Training)

A 500M parameter language model trained on multi-domain legal text using G_stack depth expansion (4x cyclic layer duplication from 170M baseline). This checkpoint represents 15,500 steps of continued training on the 120-layer architecture with Muon optimizer and Phase A improvements.

Model Details

Architecture: Llama-based with Grouped Query Attention (GQA)
Parameters: 500.3M (487M non-embedding)
Layers: 120 (4x stacked from 30)
Source: alea-institute/kl3m-007-500m-step0 (stacked from kl3m-006-170m-checkpoint-63000)
Training Steps: 15,500 (post-stacking)
Tokens Processed: 2.16 billion (127M tokens/step at 4096 context)
Sequence Length: 4,096 tokens
Precision: BF16

Model Architecture

Hidden Size: 576 (unchanged from 170M source)
Layers: 120 (4× cyclic duplication from 30)
Attention Heads: 9 (3 KV heads with GQA)
Intermediate Size: 1536 (unchanged from source)
Vocabulary: 131,072 tokens
RoPE Theta: 100,000
Parameter Growth: 181.7M → 500.3M (2.75× increase)

Training Progress

Training Metrics at Step 15500

Loss: 2.24 (instantaneous)
Loss (10-step average): 2.50
Loss (100-step average): 2.62
Learning Rate: 0.000254 (cosine decay from depth-scaled 0.000365)
Gradient Norm: 1.51 (mean: 2.27, max: 15.74 in window 15400-15600)
Gradient Clipping Events: 0% (stable training)
Cumulative Tokens: 2,160,107,520
Cumulative Samples: 527,370

Loss Trajectory

Step Range	Loss (100-avg)	Improvement
1-100	~7.66	Baseline (stacked initialization)
5000-5100	~2.63	Layers diverging, specializing
11000-12000	~2.52	Continued improvement
14000-15000	~2.63	Stabilizing
15400-15600	2.62-2.81	Current

Overall improvement: ~65% reduction in loss from initialization to step 15500.

G_stack Training Configuration

Stacking Method

Based on "Stacking Your Transformers" (NeurIPS 2024, arXiv:2405.15319):

Stacking Pattern (G_stack = G_direct depthwise):

Source (30 layers):  [0, 1, 2, ..., 28, 29]
Target (120 layers): [0-29, 0-29, 0-29, 0-29]
                     └─ Cyclic repetition 4 times (G_stack/G_direct)

Method: Direct duplication in depthwise manner (G_stack operator from paper)

Training Philosophy:

Start with proven 170M model (63K steps, 15.83B tokens)
Stack to 4× depth (120 layers) using cyclic duplication
Continue training with depth-scaled hyperparameters
Expected benefit: ~54.6% fewer tokens to reach target loss vs training from scratch (per paper)

Optimizer Configuration (Muon with Phase A)

Muon Learning Rate: 0.000365 (depth-scaled)

Base LR: 0.001
Depth scaling: √(16/120) = 0.3651
Scaled LR: 0.001 × 0.3651 = 0.000365

Auxiliary Learning Rate: 0.0005

Muon Parameters:

Weight Decay: 1e-5
Momentum: 0.95
NS Steps: 3
Nesterov: Enabled

Per-Layer Learning Rate Multipliers (Phase A):

self_attn.q_proj: 0.7× (slower for better conditioning)
self_attn.o_proj: 0.7× (slower for better conditioning)
self_attn.k_proj: 0.9×
self_attn.v_proj: 0.9×
mlp.*: 1.0× (baseline)
lm_head: 0.85×

Layer-Selective Spectral Clamping (Phase A)

Adjusted for 120-layer model with more frequent attention clamping:

Attention layers (q_proj, k_proj, v_proj, o_proj):

Frequency: Every 10 steps (higher frequency for deeper model)
Max condition: 2500
Sigma floor: 1e-4

MLP layers (gate_proj, up_proj, down_proj):

Frequency: Every 50 steps
Max condition: 3000
Sigma floor: 5e-5

LM head:

Frequency: Every 50 steps
Max condition: 2000
Sigma floor: 1e-4

Batch Configuration

Micro Batch Size: 5 (reduced from 6 to fit 500M model)
Gradient Accumulation: 16 steps (increased from 2)
Effective Batch: 80 samples (5 × 16)
Tokens per Step: ~327,680 (80 samples × 4096 tokens)

Additional Regularization

Adaptive Gradient Clipping: Enabled (β=0.9, coeff=2.0, threshold=64.0)
Label Smoothing: 0.01
Entropy Regularization:
- Entropy bonus weight: 0.003
- Entropy target: 6.5 bits (weight: 0.003)
- Activation norm weight: 0.0006
- Loss chunk size: 1024 tokens

Training Data

Dataset Composition (Steps 15400-15600)

Source: alea-institute/kl3m-data-sample-004-balanced (streaming with buffer=32)

Document Distribution:

RECAP (Court filings, briefs, motions): 65.75%
GovInfo (Federal regulations, govt docs): 16.41%
EDGAR (SEC filings, contracts): 8.20%
USPTO (Patents): ~1.4%
eCFR (Regulations): ~0.7%
Other (FR, FDLP, CAP, etc.): ~7.5%

Data Characteristics:

Streaming: Enabled with shuffle buffer=32
Pack Across Records: Enabled (efficient 4K context filling)
Format: Multi-document spans per 4K context window

Note: The training data is heavily weighted toward court documents (RECAP) which influences the model's generation style and domain expertise.

Layer Evolution Analysis

Analysis of Q-projection weight norms across checkpoints (Steps 14600-15000):

Stack-Level Divergence

Divergence Timeline:

Step 14600: 6.43%
Step 14700: 6.44%
Step 14800: 6.45%
Step 14900: 6.47%
Step 15000: 6.48%

Steady increase indicates continued layer specialization and differentiation of the 4 duplicated stacks.

Top Evolving Layers (14600 → 15000)

Fastest evolving (most specialization):

Layer 26 (Stack 1): +0.104 (+0.58%)
Layer 22 (Stack 1): +0.095 (+0.57%)
Layer 111 (Stack 4): +0.082 (+0.49%)
Layer 112 (Stack 4): +0.079 (+0.48%)
Layer 50 (Stack 2): +0.076 (+0.48%)

Slowest evolving (most stable):

Layer 64 (Stack 3): +0.003 (+0.02%)
Layer 91 (Stack 4): +0.003 (+0.02%)
Layer 1 (Stack 1): +0.003 (+0.02%)

Stack-Level Changes (14600 → 15000):

Stack 1 (layers 0-29): +0.034 (+0.21%)
Stack 2 (layers 30-59): +0.018 (+0.12%)
Stack 3 (layers 60-89): +0.021 (+0.14%)
Stack 4 (layers 90-119): +0.026 (+0.16%)

Interpretation: The 4 stacked copies are successfully diverging and specializing, with Stack 1 (early layers) showing the most evolution and middle Stack 3 being most stable.

Generation Quality

Recommended Sampling Parameters

temperature = 0.5
top_p = 0.9
repetition_penalty = 1.2
max_new_tokens = 128

Known Characteristics

Strengths:

Coherent legal text generation
Proper document structure understanding
Multi-domain legal knowledge (court, regulatory, corporate)
Good vocabulary for legal terminology

Limitations:

Domain mixing: Model occasionally shifts from contract language to court filing format
Data bias: Heavy RECAP (court documents) training distribution (65%) vs EDGAR (contracts) (8%) causes the model to default to court filing patterns
Repetition: Benefits significantly from repetition_penalty=1.2

Context Confusion Example:

Prompt: "GOVERNING LAW. This Agreement shall be governed by"
Issue: May generate "IN THE UNITED STATES DISTRICT COURT" headers inappropriately
Cause: Court documents (RECAP) represent 65% of training data

Recommended Use Cases:

Legal text understanding and analysis
Court document processing and summarization
Multi-domain legal corpus search and retrieval
Research on model stacking and depth scaling

Not Recommended:

Pure contract generation (use contract-specific fine-tuned models)
Production legal document drafting without review
Cases requiring strict separation of document types

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer

# Load model and tokenizer
model = AutoModelForCausalLM.from_pretrained(
    "alea-institute/kl3m-007-500m-checkpoint-15500",
    torch_dtype="auto",
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("alea-institute/kl3m-007-500m-checkpoint-15500")

# Generate with recommended parameters
inputs = tokenizer(
    "<|start|>This Agreement is entered into as of",
    return_tensors="pt"
).to(model.device)

outputs = model.generate(
    **inputs,
    max_new_tokens=128,
    temperature=0.5,
    top_p=0.9,
    repetition_penalty=1.2,
    do_sample=True
)

print(tokenizer.decode(outputs[0], skip_special_tokens=False))

Pipeline Usage

from transformers import pipeline

generator = pipeline(
    "text-generation",
    model="alea-institute/kl3m-007-500m-checkpoint-15500",
    torch_dtype="auto",
    device_map="auto"
)

outputs = generator(
    "<|start|>WHEREAS, the parties desire to enter into",
    max_new_tokens=128,
    temperature=0.5,
    top_p=0.9,
    repetition_penalty=1.2
)

print(outputs[0]['generated_text'])

Training Infrastructure

Mixed Precision: BF16
Gradient Checkpointing: Enabled (non-reentrant)
Flash Attention: Auto-enabled
TF32 Mode: Auto
Optimizer State: Saved with checkpoint
Tracking: Weights & Biases (wandb)

Model Comparison

Model	Layers	Params	Steps	Tokens	Loss	Use Case
kl3m-006-170m-checkpoint-63000	30	181.7M	63K	15.83B	3.18	Production 170M
kl3m-007-500m-step0	120	500.3M	0	0	~8.0	Stacked init
kl3m-007-500m-checkpoint-15500	120	500.3M	15.5K	2.16B	2.62	Current
kl3m-007-500m-checkpoint-* (future)	120	500.3M	100K+	14B+	<2.5	Production 500M target

Training Trajectory & Expectations

Achieved (Step 15500)

✓ Successful layer divergence (6.48% cross-stack variance) ✓ Stable gradient norms (mean 2.27, no clipping needed) ✓ 65% loss reduction from initialization ✓ Multi-domain legal text generation capability ✓ Proper spectral conditioning maintained

In Progress

⚠️ Domain mixing issues (court vs contract language) ⚠️ Data distribution imbalance affecting generation style ↻ Continued layer specialization (stacks still differentiating) ↻ Loss descent toward production target

Expected (Future Checkpoints)

Target Performance (by step 50-100K):

Loss < 2.5 (matching/exceeding 170M@63K quality)
Improved domain separation with continued training
Fuller stack specialization (>10% divergence)
Production-ready generation quality

G_stack Efficiency Gains (per NeurIPS 2024 paper):

~54.6% fewer tokens to reach target loss vs training 120L from scratch
~45% computational savings (complementary to token efficiency)
Expected to match 170M quality with significantly fewer resources

Spectral Health

Inherited from source checkpoint + Phase A regularization:

Attention Layers:

Max condition: Clamped at 2500 (every 10 steps)
Median condition: ~2168 (from source)
Well-conditioned through aggressive clamping frequency

MLP Layers:

Max condition: Clamped at 3000 (every 50 steps)
Median: ~5-8 (excellent)
Inherited stability from 170M source

LM Head:

Max condition: Clamped at 2000 (every 50 steps)
Excellent conditioning (~280 from source)

Training Stability: Zero gradient clipping events in recent window indicates healthy optimization landscape.

Next Steps

For Continued Training

Monitor domain mixing: Track RECAP vs EDGAR influence on generations
Rebalance data (optional): Consider increasing EDGAR proportion for contract focus
Target milestones:
- Step 25K: Expect loss ~2.4-2.5
- Step 50K: Expect loss ~2.3-2.4
- Step 100K: Expect production quality (loss <2.3)

For Deployment

Use with proper sampling parameters: temperature=0.5, top_p=0.9, repetition_penalty=1.2
Domain-specific prompting: Add context to guide toward contracts vs court documents
Consider fine-tuning: For domain-specific applications (e.g., pure contract generation)

Stacking Metadata

{
  "stacking_method": "G_stack (G_direct depthwise)",
  "implementation": "cyclic_duplication",
  "paper": "Stacking Your Transformers (NeurIPS 2024, arXiv:2405.15319)",
  "authors": "Du et al.",
  "source_checkpoint": "checkpoints/muon_170m_phase2/step-00063000",
  "growth_factor": 4,
  "source_layers": 30,
  "target_layers": 120,
  "stacking_pattern": "[0-29] repeated 4 times",
  "training_steps_post_stack": 15500,
  "tokens_processed_post_stack": 2160107520,
  "layer_divergence_pct": 6.48,
  "expected_efficiency_gain": "54.6% fewer tokens (per paper)"
}

Model Card Authors

Alea Institute

Citation

If you use this model, please cite the G_stack paper:

@inproceedings{gstack2024,
  title={Stacking Your Transformers: A Closer Look at Model Growth for Efficient LLM Pre-Training},
  author={Du, Wenyu and Luo, Tongxu and Qiu, Zihan and Huang, Zeyu and Shen, Yikang and Cheng, Reynold and Guo, Yike and Fu, Jie},
  booktitle={NeurIPS},
  year={2024},
  note={arXiv:2405.15319}
}

License

Apache 2.0

Downloads last month: 21

Safetensors

Model size

0.5B params

Tensor type

F32