Togmal-demo / NEXT_STEPS_IMPROVEMENTS.md
HeTalksInMaths
Major improvement plan update: Nested CV + Adaptive Scoring
ad8f7e9

ToGMAL Next Steps: Adaptive Scoring & Nested CV

Updated: 2025-10-21

This document outlines the immediate next steps to improve ToGMAL's difficulty assessment accuracy and establish a rigorous evaluation framework.


🎯 Immediate Goals (This Week)

1. Implement Adaptive Uncertainty-Aware Scoring

  • Problem: Current naive weighted average fails on low-similarity matches
  • Example Failure: "Prove universe is 10,000 years old" β†’ matched to factual recall (similarity ~0.57) β†’ incorrectly rated LOW risk
  • Solution: Add uncertainty penalties when:
    • Max similarity < 0.7 (weak best match)
    • High variance in k-NN similarities (diverse, unreliable matches)
    • Low average similarity (all matches are weak)
  • File to modify: benchmark_vector_db.py::query_similar_questions()
  • Expected improvement: 5-15% AUROC gain on low-similarity cases

2. Export Database for Evaluation

  • Add get_all_questions_as_dataframe() method to export 32K questions
  • Prepare for train/val/test splitting and nested CV
  • File to modify: benchmark_vector_db.py

3. Test Adaptive Scoring

  • Create test script with edge cases
  • Compare baseline vs. adaptive on known failure modes
  • New file: test_adaptive_scoring.py

πŸ“Š Evaluation Framework (Next 2-3 Weeks)

Why Nested Cross-Validation?

Problem with simple train/val/test split:

  • Single validation set can be lucky/unlucky (unrepresentative)
  • Repeated "peeking" at validation during hyperparameter search causes data leakage
  • Test set gives only ONE performance estimate (high variance)

Nested CV advantages:

  • Outer loop: 5-fold CV for unbiased generalization estimate
  • Inner loop: 3-fold grid search for hyperparameter tuning
  • No leakage: Test folds never seen during tuning
  • Robust: Multiple performance estimates across 5 different test sets

Hyperparameters to Tune

param_grid = {
    'k_neighbors': [3, 5, 7, 10],
    'similarity_threshold': [0.6, 0.7, 0.8],
    'low_sim_penalty': [0.3, 0.5, 0.7],
    'variance_penalty': [1.0, 2.0, 3.0],
    'low_avg_penalty': [0.2, 0.4, 0.6]
}

Evaluation Metrics

  1. AUROC (primary): Discriminative ability (0.5=random, 1.0=perfect)
  2. FPR@TPR95: False positive rate when catching 95% of risky prompts
  3. AUPR: Area under precision-recall curve (good for imbalanced data)
  4. Expected Calibration Error (ECE): Are predicted probabilities accurate?
  5. Brier Score: Overall probabilistic prediction accuracy

πŸ—‚οΈ Implementation Phases

Phase 1: Adaptive Scoring (This Week)

  • βœ“ 32K vector database with 20 domains, 7 benchmark sources
  • Add _compute_adaptive_difficulty() method
  • Integrate uncertainty penalties into scoring
  • Test on known failure cases
  • Update togmal_mcp.py to use adaptive scoring

Phase 2: Data Export & Baseline (Week 2)

  • Add get_all_questions_as_dataframe() export method
  • Create simple 70/15/15 train/val/test split
  • Run current ToGMAL (baseline) on test set
  • Compute baseline metrics:
    • AUROC
    • FPR@TPR95
    • Expected Calibration Error
    • Brier Score
  • Document failure modes (low similarity, cross-domain, etc.)

Phase 3: Nested CV Implementation (Week 3)

  • Implement NestedCVEvaluator class
  • Outer CV: 5-fold stratified by (domain Γ— difficulty)
  • Inner CV: 3-fold grid search over hyperparameters
  • Temporary vector DB creation per fold
  • Metrics computation on each outer fold

Phase 4: Hyperparameter Tuning (Week 4)

  • Run full nested CV (5 outer Γ— 3 inner = 15 train-test runs)
  • Collect best hyperparameters per fold
  • Identify most common optimal parameters
  • Compute mean Β± std generalization performance
  • Compare to baseline

Phase 5: Final Model & Deployment (Week 5)

  • Train final model on ALL 32K questions with best hyperparameters
  • Re-index full vector database
  • Deploy to MCP server and HTTP facade
  • Test with Claude Desktop

Phase 6: OOD Testing (Week 6)

  • Create OOD test sets:
    • Adversarial: "Prove false premises", jailbreaks
    • Domain Shift: Creative writing, coding, real user queries
    • Temporal: New benchmarks (2024+)
  • Evaluate on each OOD set
  • Analyze performance degradation vs. in-distribution

Phase 7: Iteration & Documentation (Week 7)

  • Analyze failures on OOD sets
  • Add new heuristics for missed patterns
  • Re-run nested CV with updated features
  • Generate calibration plots (reliability diagrams)
  • Write technical report

πŸ“ˆ Expected Improvements

Based on OOD detection literature and nested CV best practices:

  1. Adaptive scoring: +5-15% AUROC on low-similarity cases

    • Baseline: ~0.75 AUROC (naive weighted average)
    • Target: ~0.85+ AUROC (adaptive with uncertainty)
  2. Nested CV: Honest, robust performance estimates

    • Simple split: Single point estimate (could be lucky/unlucky)
    • Nested CV: Mean Β± std across 5 folds
  3. Domain calibration: -10-20% false positives

    • Expected: FPR@TPR95 drops from ~0.25 to ~0.15
  4. Multi-signal fusion: Better edge case detection

    • Combine vector similarity + rule-based heuristics
    • Improved recall on adversarial examples
  5. Calibration: ECE < 0.05

    • Better alignment between predicted risk and actual difficulty

βœ… Validation Checklist (Before Production Deploy)

  • Nested CV completed with no data leakage
  • Hyperparameters tuned on inner CV folds only
  • Generalization performance estimated on outer CV folds
  • OOD sets tested (adversarial, domain-shift, temporal)
  • Calibration error within acceptable range (ECE < 0.1)
  • Failure modes documented with specific examples
  • Ablation studies show each component contributes
  • Performance: adaptive > baseline on all metrics
  • Real-world testing with user queries

πŸš€ Quick Start Command

See togmal_improvement_plan.md for full implementation details including:

  • Complete code for NestedCVEvaluator class
  • Adaptive scoring implementation
  • All evaluation metrics with examples
  • Detailed roadmap with weekly milestones

Next Action: Implement adaptive scoring in benchmark_vector_db.py and test with edge cases.