Spaces:
Configuration error
Configuration error
ToGMAL Next Steps: Adaptive Scoring & Nested CV
Updated: 2025-10-21
This document outlines the immediate next steps to improve ToGMAL's difficulty assessment accuracy and establish a rigorous evaluation framework.
π― Immediate Goals (This Week)
1. Implement Adaptive Uncertainty-Aware Scoring
- Problem: Current naive weighted average fails on low-similarity matches
- Example Failure: "Prove universe is 10,000 years old" β matched to factual recall (similarity ~0.57) β incorrectly rated LOW risk
- Solution: Add uncertainty penalties when:
- Max similarity < 0.7 (weak best match)
- High variance in k-NN similarities (diverse, unreliable matches)
- Low average similarity (all matches are weak)
- File to modify:
benchmark_vector_db.py::query_similar_questions() - Expected improvement: 5-15% AUROC gain on low-similarity cases
2. Export Database for Evaluation
- Add
get_all_questions_as_dataframe()method to export 32K questions - Prepare for train/val/test splitting and nested CV
- File to modify:
benchmark_vector_db.py
3. Test Adaptive Scoring
- Create test script with edge cases
- Compare baseline vs. adaptive on known failure modes
- New file:
test_adaptive_scoring.py
π Evaluation Framework (Next 2-3 Weeks)
Why Nested Cross-Validation?
Problem with simple train/val/test split:
- Single validation set can be lucky/unlucky (unrepresentative)
- Repeated "peeking" at validation during hyperparameter search causes data leakage
- Test set gives only ONE performance estimate (high variance)
Nested CV advantages:
- Outer loop: 5-fold CV for unbiased generalization estimate
- Inner loop: 3-fold grid search for hyperparameter tuning
- No leakage: Test folds never seen during tuning
- Robust: Multiple performance estimates across 5 different test sets
Hyperparameters to Tune
param_grid = {
'k_neighbors': [3, 5, 7, 10],
'similarity_threshold': [0.6, 0.7, 0.8],
'low_sim_penalty': [0.3, 0.5, 0.7],
'variance_penalty': [1.0, 2.0, 3.0],
'low_avg_penalty': [0.2, 0.4, 0.6]
}
Evaluation Metrics
- AUROC (primary): Discriminative ability (0.5=random, 1.0=perfect)
- FPR@TPR95: False positive rate when catching 95% of risky prompts
- AUPR: Area under precision-recall curve (good for imbalanced data)
- Expected Calibration Error (ECE): Are predicted probabilities accurate?
- Brier Score: Overall probabilistic prediction accuracy
ποΈ Implementation Phases
Phase 1: Adaptive Scoring (This Week)
- β 32K vector database with 20 domains, 7 benchmark sources
- Add
_compute_adaptive_difficulty()method - Integrate uncertainty penalties into scoring
- Test on known failure cases
- Update
togmal_mcp.pyto use adaptive scoring
Phase 2: Data Export & Baseline (Week 2)
- Add
get_all_questions_as_dataframe()export method - Create simple 70/15/15 train/val/test split
- Run current ToGMAL (baseline) on test set
- Compute baseline metrics:
- AUROC
- FPR@TPR95
- Expected Calibration Error
- Brier Score
- Document failure modes (low similarity, cross-domain, etc.)
Phase 3: Nested CV Implementation (Week 3)
- Implement
NestedCVEvaluatorclass - Outer CV: 5-fold stratified by (domain Γ difficulty)
- Inner CV: 3-fold grid search over hyperparameters
- Temporary vector DB creation per fold
- Metrics computation on each outer fold
Phase 4: Hyperparameter Tuning (Week 4)
- Run full nested CV (5 outer Γ 3 inner = 15 train-test runs)
- Collect best hyperparameters per fold
- Identify most common optimal parameters
- Compute mean Β± std generalization performance
- Compare to baseline
Phase 5: Final Model & Deployment (Week 5)
- Train final model on ALL 32K questions with best hyperparameters
- Re-index full vector database
- Deploy to MCP server and HTTP facade
- Test with Claude Desktop
Phase 6: OOD Testing (Week 6)
- Create OOD test sets:
- Adversarial: "Prove false premises", jailbreaks
- Domain Shift: Creative writing, coding, real user queries
- Temporal: New benchmarks (2024+)
- Evaluate on each OOD set
- Analyze performance degradation vs. in-distribution
Phase 7: Iteration & Documentation (Week 7)
- Analyze failures on OOD sets
- Add new heuristics for missed patterns
- Re-run nested CV with updated features
- Generate calibration plots (reliability diagrams)
- Write technical report
π Expected Improvements
Based on OOD detection literature and nested CV best practices:
Adaptive scoring: +5-15% AUROC on low-similarity cases
- Baseline: ~0.75 AUROC (naive weighted average)
- Target: ~0.85+ AUROC (adaptive with uncertainty)
Nested CV: Honest, robust performance estimates
- Simple split: Single point estimate (could be lucky/unlucky)
- Nested CV: Mean Β± std across 5 folds
Domain calibration: -10-20% false positives
- Expected: FPR@TPR95 drops from ~0.25 to ~0.15
Multi-signal fusion: Better edge case detection
- Combine vector similarity + rule-based heuristics
- Improved recall on adversarial examples
Calibration: ECE < 0.05
- Better alignment between predicted risk and actual difficulty
β Validation Checklist (Before Production Deploy)
- Nested CV completed with no data leakage
- Hyperparameters tuned on inner CV folds only
- Generalization performance estimated on outer CV folds
- OOD sets tested (adversarial, domain-shift, temporal)
- Calibration error within acceptable range (ECE < 0.1)
- Failure modes documented with specific examples
- Ablation studies show each component contributes
- Performance: adaptive > baseline on all metrics
- Real-world testing with user queries
π Quick Start Command
See togmal_improvement_plan.md for full implementation details including:
- Complete code for
NestedCVEvaluatorclass - Adaptive scoring implementation
- All evaluation metrics with examples
- Detailed roadmap with weekly milestones
Next Action: Implement adaptive scoring in benchmark_vector_db.py and test with edge cases.