KL3M 500M, 7th Gen Model, Checkpoint 15500 (4x Stacked G_stack Training)

A 500M parameter language model trained on multi-domain legal text using G_stack depth expansion (4x cyclic layer duplication from 170M baseline). This checkpoint represents 15,500 steps of continued training on the 120-layer architecture with Muon optimizer and Phase A improvements.

Model Details

  • Architecture: Llama-based with Grouped Query Attention (GQA)
  • Parameters: 500.3M (487M non-embedding)
  • Layers: 120 (4x stacked from 30)
  • Source: alea-institute/kl3m-007-500m-step0 (stacked from kl3m-006-170m-checkpoint-63000)
  • Training Steps: 15,500 (post-stacking)
  • Tokens Processed: 2.16 billion (127M tokens/step at 4096 context)
  • Sequence Length: 4,096 tokens
  • Precision: BF16

Model Architecture

  • Hidden Size: 576 (unchanged from 170M source)
  • Layers: 120 (4Γ— cyclic duplication from 30)
  • Attention Heads: 9 (3 KV heads with GQA)
  • Intermediate Size: 1536 (unchanged from source)
  • Vocabulary: 131,072 tokens
  • RoPE Theta: 100,000
  • Parameter Growth: 181.7M β†’ 500.3M (2.75Γ— increase)

Training Progress

Training Metrics at Step 15500

  • Loss: 2.24 (instantaneous)
  • Loss (10-step average): 2.50
  • Loss (100-step average): 2.62
  • Learning Rate: 0.000254 (cosine decay from depth-scaled 0.000365)
  • Gradient Norm: 1.51 (mean: 2.27, max: 15.74 in window 15400-15600)
  • Gradient Clipping Events: 0% (stable training)
  • Cumulative Tokens: 2,160,107,520
  • Cumulative Samples: 527,370

Loss Trajectory

Step Range Loss (100-avg) Improvement
1-100 ~7.66 Baseline (stacked initialization)
5000-5100 ~2.63 Layers diverging, specializing
11000-12000 ~2.52 Continued improvement
14000-15000 ~2.63 Stabilizing
15400-15600 2.62-2.81 Current

Overall improvement: ~65% reduction in loss from initialization to step 15500.

G_stack Training Configuration

Stacking Method

Based on "Stacking Your Transformers" (NeurIPS 2024, arXiv:2405.15319):

Stacking Pattern (G_stack = G_direct depthwise):

Source (30 layers):  [0, 1, 2, ..., 28, 29]
Target (120 layers): [0-29, 0-29, 0-29, 0-29]
                     └─ Cyclic repetition 4 times (G_stack/G_direct)

Method: Direct duplication in depthwise manner (G_stack operator from paper)

Training Philosophy:

  • Start with proven 170M model (63K steps, 15.83B tokens)
  • Stack to 4Γ— depth (120 layers) using cyclic duplication
  • Continue training with depth-scaled hyperparameters
  • Expected benefit: ~54.6% fewer tokens to reach target loss vs training from scratch (per paper)

Optimizer Configuration (Muon with Phase A)

Muon Learning Rate: 0.000365 (depth-scaled)

  • Base LR: 0.001
  • Depth scaling: √(16/120) = 0.3651
  • Scaled LR: 0.001 Γ— 0.3651 = 0.000365

Auxiliary Learning Rate: 0.0005

Muon Parameters:

  • Weight Decay: 1e-5
  • Momentum: 0.95
  • NS Steps: 3
  • Nesterov: Enabled

Per-Layer Learning Rate Multipliers (Phase A):

  • self_attn.q_proj: 0.7Γ— (slower for better conditioning)
  • self_attn.o_proj: 0.7Γ— (slower for better conditioning)
  • self_attn.k_proj: 0.9Γ—
  • self_attn.v_proj: 0.9Γ—
  • mlp.*: 1.0Γ— (baseline)
  • lm_head: 0.85Γ—

Layer-Selective Spectral Clamping (Phase A)

Adjusted for 120-layer model with more frequent attention clamping:

Attention layers (q_proj, k_proj, v_proj, o_proj):

  • Frequency: Every 10 steps (higher frequency for deeper model)
  • Max condition: 2500
  • Sigma floor: 1e-4

MLP layers (gate_proj, up_proj, down_proj):

  • Frequency: Every 50 steps
  • Max condition: 3000
  • Sigma floor: 5e-5

LM head:

  • Frequency: Every 50 steps
  • Max condition: 2000
  • Sigma floor: 1e-4

Batch Configuration

  • Micro Batch Size: 5 (reduced from 6 to fit 500M model)
  • Gradient Accumulation: 16 steps (increased from 2)
  • Effective Batch: 80 samples (5 Γ— 16)
  • Tokens per Step: ~327,680 (80 samples Γ— 4096 tokens)

Additional Regularization

  • Adaptive Gradient Clipping: Enabled (Ξ²=0.9, coeff=2.0, threshold=64.0)
  • Label Smoothing: 0.01
  • Entropy Regularization:
    • Entropy bonus weight: 0.003
    • Entropy target: 6.5 bits (weight: 0.003)
    • Activation norm weight: 0.0006
    • Loss chunk size: 1024 tokens

Training Data

Dataset Composition (Steps 15400-15600)

Source: alea-institute/kl3m-data-sample-004-balanced (streaming with buffer=32)

Document Distribution:

  • RECAP (Court filings, briefs, motions): 65.75%
  • GovInfo (Federal regulations, govt docs): 16.41%
  • EDGAR (SEC filings, contracts): 8.20%
  • USPTO (Patents): ~1.4%
  • eCFR (Regulations): ~0.7%
  • Other (FR, FDLP, CAP, etc.): ~7.5%

Data Characteristics:

  • Streaming: Enabled with shuffle buffer=32
  • Pack Across Records: Enabled (efficient 4K context filling)
  • Format: Multi-document spans per 4K context window

Note: The training data is heavily weighted toward court documents (RECAP) which influences the model's generation style and domain expertise.

Layer Evolution Analysis

Analysis of Q-projection weight norms across checkpoints (Steps 14600-15000):

Stack-Level Divergence

Divergence Timeline:

  • Step 14600: 6.43%
  • Step 14700: 6.44%
  • Step 14800: 6.45%
  • Step 14900: 6.47%
  • Step 15000: 6.48%

Steady increase indicates continued layer specialization and differentiation of the 4 duplicated stacks.

Top Evolving Layers (14600 β†’ 15000)

Fastest evolving (most specialization):

  1. Layer 26 (Stack 1): +0.104 (+0.58%)
  2. Layer 22 (Stack 1): +0.095 (+0.57%)
  3. Layer 111 (Stack 4): +0.082 (+0.49%)
  4. Layer 112 (Stack 4): +0.079 (+0.48%)
  5. Layer 50 (Stack 2): +0.076 (+0.48%)

Slowest evolving (most stable):

  1. Layer 64 (Stack 3): +0.003 (+0.02%)
  2. Layer 91 (Stack 4): +0.003 (+0.02%)
  3. Layer 1 (Stack 1): +0.003 (+0.02%)

Stack-Level Changes (14600 β†’ 15000):

  • Stack 1 (layers 0-29): +0.034 (+0.21%)
  • Stack 2 (layers 30-59): +0.018 (+0.12%)
  • Stack 3 (layers 60-89): +0.021 (+0.14%)
  • Stack 4 (layers 90-119): +0.026 (+0.16%)

Interpretation: The 4 stacked copies are successfully diverging and specializing, with Stack 1 (early layers) showing the most evolution and middle Stack 3 being most stable.

Generation Quality

Recommended Sampling Parameters

temperature = 0.5
top_p = 0.9
repetition_penalty = 1.2
max_new_tokens = 128

Known Characteristics

Strengths:

  • Coherent legal text generation
  • Proper document structure understanding
  • Multi-domain legal knowledge (court, regulatory, corporate)
  • Good vocabulary for legal terminology

Limitations:

  • Domain mixing: Model occasionally shifts from contract language to court filing format
  • Data bias: Heavy RECAP (court documents) training distribution (65%) vs EDGAR (contracts) (8%) causes the model to default to court filing patterns
  • Repetition: Benefits significantly from repetition_penalty=1.2

Context Confusion Example:

  • Prompt: "GOVERNING LAW. This Agreement shall be governed by"
  • Issue: May generate "IN THE UNITED STATES DISTRICT COURT" headers inappropriately
  • Cause: Court documents (RECAP) represent 65% of training data

Recommended Use Cases:

  • Legal text understanding and analysis
  • Court document processing and summarization
  • Multi-domain legal corpus search and retrieval
  • Research on model stacking and depth scaling

Not Recommended:

  • Pure contract generation (use contract-specific fine-tuned models)
  • Production legal document drafting without review
  • Cases requiring strict separation of document types

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer

# Load model and tokenizer
model = AutoModelForCausalLM.from_pretrained(
    "alea-institute/kl3m-007-500m-checkpoint-15500",
    torch_dtype="auto",
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("alea-institute/kl3m-007-500m-checkpoint-15500")

# Generate with recommended parameters
inputs = tokenizer(
    "<|start|>This Agreement is entered into as of",
    return_tensors="pt"
).to(model.device)

outputs = model.generate(
    **inputs,
    max_new_tokens=128,
    temperature=0.5,
    top_p=0.9,
    repetition_penalty=1.2,
    do_sample=True
)

print(tokenizer.decode(outputs[0], skip_special_tokens=False))

Pipeline Usage

from transformers import pipeline

generator = pipeline(
    "text-generation",
    model="alea-institute/kl3m-007-500m-checkpoint-15500",
    torch_dtype="auto",
    device_map="auto"
)

outputs = generator(
    "<|start|>WHEREAS, the parties desire to enter into",
    max_new_tokens=128,
    temperature=0.5,
    top_p=0.9,
    repetition_penalty=1.2
)

print(outputs[0]['generated_text'])

Training Infrastructure

  • Mixed Precision: BF16
  • Gradient Checkpointing: Enabled (non-reentrant)
  • Flash Attention: Auto-enabled
  • TF32 Mode: Auto
  • Optimizer State: Saved with checkpoint
  • Tracking: Weights & Biases (wandb)

Model Comparison

Model Layers Params Steps Tokens Loss Use Case
kl3m-006-170m-checkpoint-63000 30 181.7M 63K 15.83B 3.18 Production 170M
kl3m-007-500m-step0 120 500.3M 0 0 ~8.0 Stacked init
kl3m-007-500m-checkpoint-15500 120 500.3M 15.5K 2.16B 2.62 Current
kl3m-007-500m-checkpoint-* (future) 120 500.3M 100K+ 14B+ <2.5 Production 500M target

Training Trajectory & Expectations

Achieved (Step 15500)

βœ“ Successful layer divergence (6.48% cross-stack variance) βœ“ Stable gradient norms (mean 2.27, no clipping needed) βœ“ 65% loss reduction from initialization βœ“ Multi-domain legal text generation capability βœ“ Proper spectral conditioning maintained

In Progress

⚠️ Domain mixing issues (court vs contract language) ⚠️ Data distribution imbalance affecting generation style ↻ Continued layer specialization (stacks still differentiating) ↻ Loss descent toward production target

Expected (Future Checkpoints)

Target Performance (by step 50-100K):

  • Loss < 2.5 (matching/exceeding 170M@63K quality)
  • Improved domain separation with continued training
  • Fuller stack specialization (>10% divergence)
  • Production-ready generation quality

G_stack Efficiency Gains (per NeurIPS 2024 paper):

  • ~54.6% fewer tokens to reach target loss vs training 120L from scratch
  • ~45% computational savings (complementary to token efficiency)
  • Expected to match 170M quality with significantly fewer resources

Spectral Health

Inherited from source checkpoint + Phase A regularization:

Attention Layers:

  • Max condition: Clamped at 2500 (every 10 steps)
  • Median condition: ~2168 (from source)
  • Well-conditioned through aggressive clamping frequency

MLP Layers:

  • Max condition: Clamped at 3000 (every 50 steps)
  • Median: ~5-8 (excellent)
  • Inherited stability from 170M source

LM Head:

  • Max condition: Clamped at 2000 (every 50 steps)
  • Excellent conditioning (~280 from source)

Training Stability: Zero gradient clipping events in recent window indicates healthy optimization landscape.

Next Steps

For Continued Training

  1. Monitor domain mixing: Track RECAP vs EDGAR influence on generations
  2. Rebalance data (optional): Consider increasing EDGAR proportion for contract focus
  3. Target milestones:
    • Step 25K: Expect loss ~2.4-2.5
    • Step 50K: Expect loss ~2.3-2.4
    • Step 100K: Expect production quality (loss <2.3)

For Deployment

  1. Use with proper sampling parameters: temperature=0.5, top_p=0.9, repetition_penalty=1.2
  2. Domain-specific prompting: Add context to guide toward contracts vs court documents
  3. Consider fine-tuning: For domain-specific applications (e.g., pure contract generation)

Stacking Metadata

{
  "stacking_method": "G_stack (G_direct depthwise)",
  "implementation": "cyclic_duplication",
  "paper": "Stacking Your Transformers (NeurIPS 2024, arXiv:2405.15319)",
  "authors": "Du et al.",
  "source_checkpoint": "checkpoints/muon_170m_phase2/step-00063000",
  "growth_factor": 4,
  "source_layers": 30,
  "target_layers": 120,
  "stacking_pattern": "[0-29] repeated 4 times",
  "training_steps_post_stack": 15500,
  "tokens_processed_post_stack": 2160107520,
  "layer_divergence_pct": 6.48,
  "expected_efficiency_gain": "54.6% fewer tokens (per paper)"
}

Model Card Authors

Alea Institute

Citation

If you use this model, please cite the G_stack paper:

@inproceedings{gstack2024,
  title={Stacking Your Transformers: A Closer Look at Model Growth for Efficient LLM Pre-Training},
  author={Du, Wenyu and Luo, Tongxu and Qiu, Zihan and Huang, Zeyu and Shen, Yikang and Cheng, Reynold and Guo, Yike and Fu, Jie},
  booktitle={NeurIPS},
  year={2024},
  note={arXiv:2405.15319}
}

License

Apache 2.0

Downloads last month
21
Safetensors
Model size
0.5B params
Tensor type
F32
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support