Domain: K12 math difficulty calibration and intelligent exam matching Researched: 2026-04-16 Confidence: HIGH (codebase analysis); MEDIUM (platform feature comparisons based on web research)
Features the system must have. Without these, difficulty calibration is meaningless and exam assembly produces poor matches.
| # | Feature | Why Expected | Complexity | Notes |
|---|---|---|---|---|
| TS-1 | Historical data backtesting with temporal split | Every validated adaptive system (ALEKS, Khan Academy, IXL) validates its parameter estimation against held-out data before production use. Without this, the entire calibration pipeline is an unverified hypothesis. | Medium | QuestionDifficultyCalibrationAnalyzer already computes per-question stats and Pearson correlation. Needs: temporal train/test split, aggregated Brier score, MAE, PASS/FAIL gate. The data (paper_questions, questions, papers) already exists. |
| TS-2 | Difficulty scale normalization (0-1 vs 0-5 unification) | The codebase has normalizeDifficultyValue() in QuestionDifficultyCalibrationService for calibration input, but LearningAnalyticsService loads raw questions.difficulty without normalization during assembly. This is a correctness bug, not a nice-to-have. A 0-5 value of 0.4 treated as 0-1 (40% difficulty) inverts the intended difficulty. |
Low | Extract and centralize the existing normalization logic. Apply at the question-loading boundary in LearningAnalyticsService. One new service class, no algorithm changes. |
| TS-3 | Calibrated difficulty used in exam assembly | The calibration algorithm (stratified_residual_eb_v2) runs on every grading event (updateOnlineFromPaper), writes to question_difficulty_calibrations, and QuestionDifficultyResolver.applyCalibratedDifficulty() exists -- but it is not called in the main assembly path (LearningAnalyticsService.generateIntelligentExam). The calibration loop is complete but disconnected. |
Low | Wire the existing QuestionDifficultyResolver call into LearningAnalyticsService.selectQuestions(). The resolver already handles the fallback chain (calibrated > normalized original). Requires TS-1 PASS gate to be open first. |
| TS-4 | Difficulty distribution enabled by default | enable_difficulty_distribution defaults to false in LearningAnalyticsService line 1554. Only ExamTypeStrategy sets it to true. The API main path never activates distribution. Without distribution, all questions are selected by raw difficulty matching without the tiered low/medium/high bucket strategy. |
Low | Change default to true after TS-1 and TS-2 are complete. The DifficultyDistributionService logic is fully implemented and tested. |
| TS-5 | Calibration coverage metrics | Teachers and admins need to know what percentage of questions have calibrated values. ALEKS reports item parameter coverage; IXL shows diagnostic completeness. If only 20% of questions have calibration data, the system should not claim to use calibrated difficulty. | Low | Count question_difficulty_calibrations rows vs total active questions. Add to the existing AnalyzeQuestionDifficultyCalibrationCommand output. No new infrastructure needed. |
| TS-6 | Per-question calibration confidence indicator | IRT systems report standard error of estimation per item. The existing algorithm_meta stores recent_events with Brier/log-loss per event, and the analyzer computes calibration_effective_attempts (time-decay weighted sample size). These need to surface as a single confidence metric per question. |
Low | Derive from existing data: effective_attempts >= 10 = HIGH confidence, 5-9 = MEDIUM, < 5 = LOW. Already computed in the analyzer; just needs a reusable service method and API exposure. |
Features that set this platform apart from basic question banks. These create the "answer -> calibrate -> precise exam -> re-answer" closed loop.
| # | Feature | Value Proposition | Complexity | Notes |
|---|---|---|---|---|
| D-1 | Mastery-based difficulty category auto-recommendation | Currently difficulty_category (0-4) is passed from the external API call with no validation against student level. ALEKS achieves 90%+ success on "ready-to-learn" concepts by matching difficulty to learner state. Auto-recommending category from mastery eliminates the teacher having to guess which tier to assign. |
Medium | New DifficultyCategoryRecommender service. Maps mastery (from MasteryCalculator, already per-knowledge-point) to category via thresholds. Must handle multi-kp exams: weight weaker knowledge points more heavily. Depends on TS-3 (calibrated difficulty in assembly). |
| D-2 | Shadow mode comparison logging | Before fully activating calibrated difficulty, run both paths (raw vs calibrated) in parallel and log comparisons. This is standard in production ML systems. Reveals real-world impact of calibration without risking exam quality. | Low | Log raw vs calibrated average difficulty, delta distribution, coverage percentage. No behavioral change. Can activate immediately after TS-1 passes. |
| D-3 | Calibration health monitoring dashboard | The existing getHealthScaleForType() monitors Brier/log-loss delta and auto-scales step size (0.45-1.0x), but this data is only in logs. A dashboard showing: overall health scale, per-type Brier trend, coverage percentage, drift alerts, top outlier questions. IXL's "Trouble Spots" report is analogous for student-level analytics. |
Medium | New calibration_health_snapshots table (daily aggregation). Scheduled artisan command to compute daily. Admin panel to visualize. The computation logic already exists in getHealthScaleForType(). |
| D-4 | Calibration drift detection and alerting | Over time, question difficulty can shift (curriculum changes, student population changes). The existing algorithm has time decay (45-day half-life) but no explicit drift detection. Detect when calibrated values systematically diverge from recent observations. | Medium | Compare rolling 7-day vs 30-day average Brier score. If 7-day is significantly worse (>20% increase), flag as drift. The data is already in algorithm_meta.recent_events. Needs: scheduled check, notification mechanism. |
| D-5 | Outlier quarantine with human review | When a question's calibrated difficulty differs from original by more than a threshold (e.g., delta > 0.30), quarantine it for human review rather than silently using the extreme value. This catches edge cases: wrong answer keys, ambiguous questions, data entry errors. | Low | CalibrationVerificationGate checks delta after each calibration update. Flagged questions still get calibrated values but are marked for review. Admin UI shows quarantine list. |
| D-6 | Per-knowledge-point calibration accuracy breakdown | The existing analyzer computes Pearson correlation globally. Breaking this down by knowledge point reveals where calibration works well and where it fails. Some knowledge points may have too few questions or too little variance for calibration to be meaningful. | Medium | Extend QuestionDifficultyCalibrationAnalyzer to group by kp_code. Requires joining questions -> knowledge_points which already exists. Surface in CLI report and API. |
| D-7 | Student-facing difficulty confidence indicator | Show students a visual indicator of how well-matched the exam is to their level. Analogous to Khan Academy's mastery progress bars. "This exam is calibrated for your current level" vs "This exam covers new difficulty territory." | Low | Simple badge based on: was calibrated difficulty used? (difficulty_source='calibrated') AND coverage > 60%? Client-side rendering, backend just provides metadata. |
Features to explicitly NOT build. These would harm the system or waste effort.
| # | Anti-Feature | Why Avoid | What to Do Instead |
|---|---|---|---|
| AF-1 | Write calibrated values back to questions.difficulty |
Destroys the immutable original reference value. Makes debugging impossible. Violates the explicit project constraint. Creates data integrity risk -- if calibration goes wrong, the original is gone. | Keep dual-table design. questions.difficulty is append-only. question_difficulty_calibrations is the overlay. QuestionDifficultyResolver handles priority. |
| AF-2 | Full IRT 3PL model with discrimination and guessing parameters | The system has a working algorithm (stratified_residual_eb_v2) that was designed for this specific use case. Switching to 3PL would require: (a) reimplementing the entire calibration engine, (b) much more data per question to estimate 3 parameters, (c) guessing parameter is meaningless for K12 math (students rarely guess systematically). |
Validate the existing algorithm first (TS-1). Only consider algorithm changes if backtesting reveals systematic failure. The existing stratified baseline + residual + Bayesian shrinkage is well-suited. |
| AF-3 | Real-time adaptive testing (CAT) within a single exam | CAT requires: calibrated item pool (we are building this), real-time ability estimation during exam, item selection algorithm, exposure control. This is a fundamentally different product (computer-based testing) from the current paper/worksheet assembly model. Massive scope expansion. | Focus on accurate pre-exam difficulty matching. The exam is assembled once, not adapted mid-session. The "adaptive" element is between exams (mastery changes -> different difficulty category next time). |
| AF-4 | Global difficulty category per student | A student may be category 2 (intermediate) in algebra but category 0 (zero-foundation) in geometry. Assigning one global category produces mismatched exams -- too hard in geometry, too easy in algebra. | Per-knowledge-point mastery -> per-knowledge-point difficulty recommendation (D-1). Aggregate with weakness weighting for multi-kp exams. |
| AF-5 | Calibration without minimum sample size enforcement | Calibrating a question based on 1-2 answers produces wildly unreliable estimates. The algorithm already has SHRINKAGE_M0_MIN = 8.0 as a prior strength, but the assembly path should not use calibrated values for questions below a sample threshold. |
Enforce minimum effective attempts (e.g., >= 5) before using calibrated value in assembly. Below threshold, fall back to normalized original. The QuestionDifficultyResolver already has the data to implement this. |
| AF-6 | Automatic algorithm parameter tuning | Automatically adjusting alpha, max_step, half_life_days based on backtest results. Sounds appealing but risks overfitting to historical data and removing human oversight from a critical algorithmic decision. |
Provide backtest reports with parameter sensitivity analysis (run backtest at 3-5 parameter settings). Let humans make the final tuning decision. |
| AF-7 | Student-level difficulty prediction (this student will get this question wrong) | The calibration system predicts population-level difficulty (what fraction of students will answer incorrectly). Individual prediction requires a student ability model (like IRT theta), which is a separate system. Conflating the two produces unreliable results. | Use mastery (from MasteryCalculator) for individual-level assessment. Use calibration for question-level difficulty estimation. Combine them at exam assembly time, not at prediction time. |
TS-1 (Backtesting)
|
+---> TS-3 (Calibrated difficulty in assembly) [requires PASS gate]
| |
| +---> TS-4 (Distribution enabled by default) [safe after calibrated assembly]
| |
| +---> D-1 (Auto difficulty category) [needs calibrated assembly working]
| |
| +---> D-7 (Student confidence indicator) [needs calibrated assembly]
|
+---> D-2 (Shadow mode) [can run in parallel with TS-3]
TS-2 (Scale normalization)
|
+---> TS-3 (Calibrated difficulty in assembly) [needs consistent scale]
TS-5 (Coverage metrics) [independent]
TS-6 (Confidence indicator) [independent, uses existing data]
D-3 (Health dashboard) ---> D-4 (Drift detection) [dashboard provides baseline]
D-5 (Outlier quarantine) [independent, can build anytime]
D-6 (Per-kp breakdown) [independent, extends existing analyzer]
Critical path: TS-1 -> TS-2 -> TS-3 -> TS-4 -> D-1
This is the minimum path to the closed loop described in the project vision ("answer -> calibrate -> precise exam -> re-answer").
Phase 1 -- Validation and Foundation (must ship first)
Phase 2 -- Production Integration (ship after Phase 1 PASS)
Phase 3 -- Adaptive Intelligence (ship after Phase 2 stable)
Defer:
getHealthScaleForType() provides the raw data; a dashboard is a presentation layer.How this system's features compare to established platforms:
| Feature | ALEKS | IXL | Khan Academy | This System |
|---|---|---|---|---|
| Difficulty calibration method | Knowledge Space Theory (probabilistic knowledge states) | Proprietary adaptive algorithm | IRT-based mastery | Stratified residual empirical Bayes |
| Validation approach | Billions of data points, research papers | Real-time diagnostic validation | A/B testing, mastery prediction accuracy | Backtesting (to build) |
| Difficulty granularity | Per-topic per-student | Per-skill per-student | Per-exercise per-student | Per-question (calibrated), per-kp per-student (mastery) |
| Adaptive exam assembly | Full CAT (selects next item based on current ability estimate) | Adaptive practice (not exam) | Mastery-based progression | Static assembly with difficulty distribution (pre-exam) |
| Health monitoring | Implicit (built into KST algorithm) | Real-Time Diagnostic, Trouble Spots | Internal accuracy metrics | Health scale (exists), Dashboard (to build) |
| Student-facing feedback | "Ready to learn" / "Not ready" indicators | Diagnostic strand analysis, progress reports | Mastery progress bars, streak counters | Calibration badge (to build) |
| Admin/teacher analytics | Learning progress, knowledge pie chart | Live Classroom, Trouble Spots, reports | Coach dashboard, assignment analytics | CLI report (exists), Dashboard (to build) |
Key insight: This system's unique advantage is the dual-difficulty model (original + calibrated overlay) with an explicit validation gate. No other platform exposes calibration confidence or separates original vs calibrated difficulty. This transparency is a differentiator for institutional customers who need to audit and trust the algorithm.
QuestionDifficultyCalibrationService (974 lines), QuestionDifficultyCalibrationAnalyzer, QuestionDifficultyResolver, DifficultyDistributionService, MasteryCalculator, IntelligentExamController, ExamAnswerAnalysisService, LearningAnalyticsService (HIGH confidence)