Domain: K12 math difficulty calibration and intelligent exam matching Researched: 2026-04-16 Confidence: HIGH (based on direct codebase analysis); MEDIUM (general adaptive testing patterns from training data)
ANSWER FLOW (already working)
============================
Student answers exam
|
v
ExamAnswerAnalysisService.analyzeExamAnswers()
|
+---> MasteryCalculator (knowledge point mastery)
+---> KnowledgeMasteryService (persist mastery)
+---> LocalAIAnalysisService (update mastery)
+---> MistakeBookService (add to mistake book)
+---> QuestionDifficultyCalibrationService.updateOnlineFromPaper()
|
v
question_difficulty_calibrations table (upsert)
ASSEMBLY FLOW (partially working)
================================
POST /api/intelligent-exam
|
v
IntelligentExamController.store()
|
v
AssembleExamTaskJob (queued)
|
v
LearningAnalyticsService.generateIntelligentExam()
|
+---> selectQuestions() -- uses raw questions.difficulty
+---> applyTypeAwareDifficultyDistribution()
| |
| v
| DifficultyDistributionService
| (only if enable_difficulty_distribution=true,
| which defaults to FALSE)
|
v
QuestionDifficultyResolver.applyCalibratedDifficulty()
(exists but NOT called in the main assembly path)
| Gap | Location | Impact |
|---|---|---|
| Calibration values not used in assembly | LearningAnalyticsService selects questions using raw questions.difficulty |
Assembled exams use uncalibrated difficulty |
enable_difficulty_distribution defaults false |
LearningAnalyticsService line 1554 |
Distribution strategy never activates unless caller explicitly enables |
| No auto difficulty_category recommendation | No service maps mastery to category | Teachers must manually pick tier; no student-level adaptation |
| No backtesting validation | QuestionDifficultyCalibrationAnalyzer reports but does not validate |
Algorithm accuracy unknown before production use |
| Dual difficulty scale (0-1 vs 0-5) | normalizeDifficultyValue() divides by 5 if > 1.0 |
Inconsistent source data enters calibration |
+===================================================================================+
| TARGET ARCHITECTURE |
+===================================================================================+
LAYER 1: VALIDATION (must complete before anything else)
+--------------------------------------------------------+
| |
| CalibrationBacktestService |
| |-- backtestAgainstHistory(cutoffDate) |
| |-- computeBrierScores(questionIds) |
| |-- computePearsonCorrelation() |
| |-- produceValidationReport() |
| +-- PASS/FAIL gate: algo accuracy threshold |
| |
| Data: reads paper_questions + questions (historical) |
| Writes: backtest_results table (or JSON export) |
+--------------------------------------------------------+
|
| PASS gate opens production use
v
LAYER 2: CALIBRATION FEEDBACK LOOP (enhance existing)
+--------------------------------------------------------+
| |
| ExamAnswerAnalysisService |
| |-- (existing) analyzeExamAnswers() |
| +-- (existing) recalibrateQuestionDifficulty() |
| |
| QuestionDifficultyCalibrationService |
| |-- updateOnlineFromPaper() [existing, per-paper] |
| |-- recalibrateQuestionIds() [existing, batch] |
| +-- getHealthScaleForType() [existing, monitoring]|
| |
| NEW: CalibrationVerificationGate |
| |-- validateCalibratedRange(questionIds) |
| |-- flagOutliers(threshold) |
| +-- quarantineBadCalibrations() |
| |
| Data: answer --> calibrate --> verify --> use |
+--------------------------------------------------------+
|
v
LAYER 3: ASSEMBLY INTEGRATION (connect calibration to exam)
+--------------------------------------------------------+
| |
| DifficultyNormalizationService [NEW] |
| |-- normalize(questionId) -> float [0,1] |
| |-- batchNormalize(questionIds) -> map |
| +-- resolves 0-1 vs 0-5 ambiguity at read time |
| |
| QuestionDifficultyResolver [existing, expand usage] |
| |-- applyCalibratedDifficulty(questions) -> arr |
| +-- MUST be called in assembly path |
| |
| LearningAnalyticsService |
| |-- generateIntelligentExam() |
| | +-- CALL DifficultyNormalizationService first |
| | +-- CALL QuestionDifficultyResolver second |
| | +-- SET enable_difficulty_distribution = true |
| +-- remove hard-coded default false |
| |
| DifficultyDistributionService [existing] |
| |-- calculateDistribution(category, total) |
| +-- groupQuestionsByDifficultyRange() |
+--------------------------------------------------------+
|
v
LAYER 4: ADAPTIVE MATCHING (mastery-based difficulty selection)
+--------------------------------------------------------+
| |
| DifficultyCategoryRecommender [NEW] |
| |-- recommendForStudent(studentId, kpCodes) -> cat |
| |-- recommendForKnowledgePoint(studentId, kp) -> cat|
| +-- uses MasteryCalculator + calibration data |
| |
| MasteryCalculator [existing] |
| |-- calculateMasteryLevel(studentId, kpCode) |
| +-- returns mastery [0,1] + confidence + trend |
| |
| Mapping logic: |
| mastery [0.0, 0.30) -> category 0 (zero-foundation) |
| mastery [0.30, 0.50) -> category 1 (foundation) |
| mastery [0.50, 0.70) -> category 2 (intermediate) |
| mastery [0.70, 0.85) -> category 3 (advanced) |
| mastery [0.85, 1.00) -> category 4 (competition) |
+--------------------------------------------------------+
|
v
LAYER 5: HEALTH MONITORING (continuous)
+--------------------------------------------------------+
| |
| CalibrationHealthMonitor [NEW] |
| |-- detectDrift(windowDays) -> drift report |
| |-- accuracyTrend(days) -> accuracy over time |
| |-- calibrationCoverage() -> % questions calibrated |
| +-- scheduled artisan command (daily/weekly) |
| |
| Existing health mechanisms: |
| |-- getHealthScaleForType() in CalibrationService |
| +-- recent_events in algorithm_meta (per-question) |
| |
| NEW: calibration_health_snapshots table |
| |-- date, total_calibrated, avg_brier, |
| | coverage_pct, drift_flag, action |
+--------------------------------------------------------+
| Component | Responsibility | Communicates With | New/Existing |
|---|---|---|---|
CalibrationBacktestService |
Validate algorithm accuracy against historical data | Reads paper_questions, questions, papers. Writes report output. |
NEW |
QuestionDifficultyCalibrationService |
Core calibration algorithm (stratified_residual_eb_v2) | Called by ExamAnswerAnalysisService, CalibrationBacktestService |
EXISTING |
CalibrationVerificationGate |
Post-calibration sanity checks (range, outlier detection) | Reads question_difficulty_calibrations. Flags problematic entries. |
NEW |
DifficultyNormalizationService |
Unify 0-1 / 0-5 scale at read boundary | Called by LearningAnalyticsService during question loading |
NEW |
QuestionDifficultyResolver |
Apply calibrated difficulty to question arrays, calibrated-first | Called in assembly path by LearningAnalyticsService |
EXISTING (needs wiring) |
DifficultyDistributionService |
Calculate difficulty buckets per category | Called by LearningAnalyticsService when distribution enabled |
EXISTING |
DifficultyCategoryRecommender |
Map student mastery to recommended difficulty category | Reads MasteryCalculator. Used by IntelligentExamController |
NEW |
MasteryCalculator |
Calculate per-knowledge-point mastery levels | Existing, unchanged | EXISTING |
CalibrationHealthMonitor |
Detect calibration drift, coverage gaps, accuracy degradation | Reads question_difficulty_calibrations. Writes health snapshots. |
NEW |
LearningAnalyticsService |
Orchestrate question selection and difficulty distribution | Must call normalization + resolver + distribution | EXISTING (needs modification) |
QUESTION DIFFICULTY LIFECYCLE
=============================
questions.difficulty (original, immutable)
|
v
DifficultyNormalizationService.normalize()
| (resolves 0-1 vs 0-5, stores original_difficulty)
v
question_difficulty_calibrations.original_difficulty
|
+---[calibration loop]---> calibrated_difficulty
| |
v v
QuestionDifficultyResolver.applyCalibratedDifficulty()
|
| Returns: calibrated if exists, else original (normalized)
v
DifficultyDistributionService.groupQuestionsByDifficultyRange()
|
v
Selected questions for exam assembly
ANSWER-TO-CALIBRATION FEEDBACK LOOP
====================================
Student submits exam answers
|
v
ExamAnswerAnalysisService.analyzeExamAnswers()
|
+---> MasteryCalculator (update knowledge mastery)
+---> QuestionDifficultyCalibrationService.updateOnlineFromPaper()
|
+-- per-question: compute residual, apply shrinkage, clamp
+-- upsert to question_difficulty_calibrations
+-- append to recent_events in algorithm_meta
|
v
CalibrationVerificationGate (NEW)
|
+-- check calibrated_difficulty in [0.01, 0.99]
+-- flag if delta > 0.30 from original
+-- quarantine if Brier score deteriorating
|
v
Health monitor caches invalidated
(getHealthScaleForType will recompute next call)
BACKTESTING VALIDATION FLOW
============================
CalibrationBacktestService.backtestAgainstHistory(cutoffDate)
|
v
1. Load all questions with >= N attempts before cutoffDate
2. Split: training set (before cutoff) vs test set (after cutoff)
3. Run calibration on training data only
4. For each question in test set:
- predicted = calibrated_difficulty from training
- actual = observed error rate in test period
- Brier score = (predicted - actual)^2
5. Aggregate metrics:
- Mean Brier score (lower = better, < 0.15 is acceptable)
- Pearson correlation (predicted vs actual, > 0.4 is acceptable)
- Calibration coverage (% questions with enough data)
- MAE (mean absolute error, < 0.15 is acceptable)
6. PASS gate:
- Pearson > 0.3 AND Mean Brier < 0.20
- If FAIL: algorithm needs tuning, do NOT enable in production
|
v
Report: JSON/CSV output + PASS/FAIL verdict
MASTERY-TO-DIFFICULTY MATCHING FLOW
====================================
Exam request (student_id + kp_codes)
|
v
DifficultyCategoryRecommender.recommendForStudent()
|
+-- for each kp_code:
| MasteryCalculator.calculateMasteryLevel(studentId, kp)
| -> mastery [0,1], confidence, trend
|
+-- aggregate mastery across kp_codes (weighted average)
+-- map to category:
| mastery -> category via threshold table
| adjust for trend: trending up -> +0.5 category push
| floor at 0, cap at 4
|
+-- return: recommended category + confidence + reasoning
|
v
IntelligentExamController uses recommended category
|
v
LearningAnalyticsService with enable_difficulty_distribution=true
What: A calibration value must pass validation before it can influence production behavior. Once validated, components progressively unlock. When: Any system where unvalidated statistical estimates would harm user experience. Why this matters here: The project explicitly requires "validation before production use." The current code already has calibration running, but it is not connected to assembly. This is correct; the backtest gate formalizes the transition.
Gate states:
LOCKED -- calibration runs, values stored, NOT used in assembly
TESTED -- backtest passed, enable for shadow mode (log but don't act)
ACTIVE -- fully enabled in production assembly path
Example implementation:
// In a config or database table
'calibration_gate' => 'locked', // locked | tested | active
// In LearningAnalyticsService, during assembly:
if (config('calibration_gate') === 'active') {
$questions = $resolver->applyCalibratedDifficulty($questions);
}
What: When multiple difficulty values exist for a question, follow a deterministic priority chain rather than ad-hoc logic.
When: Any lookup where calibrated, original, and estimated values coexist.
Why: The current QuestionDifficultyResolver already implements this pattern correctly (calibrated > original). It just needs to be consistently called.
// Priority chain (already implemented in QuestionDifficultyResolver):
// 1. calibrated_difficulty (from question_difficulty_calibrations)
// 2. normalized questions.difficulty (0-1 scale, divide-by-5 if needed)
// 3. fallback 0.5 (moderate default)
What: Before enabling calibrated difficulty in actual exam assembly, run both paths in parallel and compare results without affecting output. When: Connecting a validated but previously-disconnected statistical system to production. Why: Even with backtest validation, real-time behavior may differ from historical backtest. Shadow mode catches integration bugs.
// In LearningAnalyticsService assembly:
$rawQuestions = $selectedQuestions; // current behavior
$calibratedQuestions = $resolver->applyCalibratedDifficulty($rawQuestions);
// Log comparison without using calibrated values yet
Log::info('Shadow mode difficulty comparison', [
'raw_avg' => collect($rawQuestions)->avg('difficulty'),
'calibrated_avg' => collect($calibratedQuestions)->avg('difficulty'),
'diff_count' => count(array_filter($calibratedQuestions, fn($q) =>
($q['difficulty_source'] ?? '') === 'calibrated'
)),
]);
// Use raw (unchanged behavior) until gate opens
$selectedQuestions = $rawQuestions;
What: Already implemented in the codebase. The calibration algorithm computes expected error rates per (question_type, difficulty_category) stratum, then adjusts based on the residual (observed - expected). When: This is the core calibration algorithm. No changes needed to the algorithm itself per project scope.
The existing algorithm is well-structured:
buildGlobalBaselines() computes per-stratum error ratesestimateOnlineBySingleOutcome() processes one answer eventestimateByStratifiedResidual() processes historical datagetHealthScaleForType() auto-reduces step size when degradingWhat: Weight recent observations more heavily than old ones using exponential decay. Already implemented with 45-day half-life. When: Any aggregation of student performance or calibration data. Why: K12 students improve; old responses are less predictive. The existing 45-day half-life is reasonable.
What: Writing calibrated values back to questions.difficulty.
Why bad: Destroys the original reference value, makes debugging impossible, violates project constraint.
Instead: Keep the dual-table design. questions.difficulty is append-only immutable. question_difficulty_calibrations is the mutable overlay.
What: Computing one difficulty_category for a student across all knowledge points and applying it everywhere. Why bad: A student may be advanced in algebra but beginner in geometry. Global category creates mismatched exams. Instead: Per-knowledge-point mastery -> per-knowledge-point difficulty recommendation. Aggregate only when exam spans multiple knowledge points.
What: Wiring calibration directly into the assembly path without the backtest validation step. Why bad: If the algorithm has systematic bias (e.g., always overestimates difficulty for certain question types), it makes exams worse, not better. Instead: Backtest first. The backtest is a prerequisite gate, not an optional report.
What: Mixing 0-5 scale difficulty values with 0-1 scale values in the same computation.
Why bad: A 0-5 value of 0.4 (easy) gets treated as 0.4 on 0-1 scale (hard), producing inverted difficulty estimates.
Instead: Normalize at the read boundary. The existing normalizeDifficultyValue() in QuestionDifficultyCalibrationService handles this for calibration input, but LearningAnalyticsService does not normalize when loading questions for assembly. This must be fixed.
What: Allowing calibration to update on every single answer without any dampening. Why bad: A single anomalous cohort (e.g., a class that all guesses randomly) can corrupt calibration values. Instead: The existing algorithm handles this well with: shrinkage (prior pulls toward original), step limits, minimum sample thresholds, and health scaling. Do not remove these safeguards.
| Concern | At 100 questions | At 10K questions | At 100K questions |
|---|---|---|---|
| Calibration table size | Negligible | ~10K rows, fast with index on question_bank_id |
~100K rows; add composite index on (calibrated_difficulty, updated_at) |
| Backtest computation | < 1 second | 5-30 seconds depending on attempt count | Minutes; run as queued job, cache results |
| Per-answer calibration | < 10ms (single upsert) | < 10ms | < 10ms (indexed lookup + single upsert) |
| Health monitoring | Negligible (scans recent rows) | 1-5 seconds (parsing algorithm_meta JSON) | 5-30 seconds; extract health metrics to dedicated columns |
| Mastery-to-category recommendation | < 50ms (1 mastery lookup) | < 50ms | < 100ms (batch mastery lookup for multiple kp_codes) |
applyCalibratedDifficulty batch |
< 5ms | < 20ms (WHERE IN) | < 100ms; add chunking for > 1000 IDs |
Phase 1: VALIDATION (must be first -- blocks everything else)
1.1 CalibrationBacktestService
- Reads historical data
- Computes Pearson, Brier, MAE
- Produces PASS/FAIL report
DEPENDS ON: existing QuestionDifficultyCalibrationService, existing data
BLOCKS: Phase 2, 3, 4
Phase 2: DIFFICULTY STANDARDIZATION (no behavioral change)
2.1 DifficultyNormalizationService
- Extract and centralize normalizeDifficultyValue() logic
- Apply at question-loading boundary in LearningAnalyticsService
DEPENDS ON: nothing new
BLOCKS: Phase 3 (need consistent scale before using calibration)
Phase 3: ASSEMBLY INTEGRATION (wires calibration into production)
3.1 Wire QuestionDifficultyResolver into LearningAnalyticsService
- Call applyCalibratedDifficulty() in the assembly path
- Enable difficulty_distribution by default
- Add shadow mode logging first, then activate
3.2 CalibrationVerificationGate
- Post-upsert sanity checks
- Outlier quarantine
DEPENDS ON: Phase 1 (PASS gate), Phase 2 (consistent scale)
BLOCKS: Phase 4
Phase 4: ADAPTIVE MATCHING (new feature)
4.1 DifficultyCategoryRecommender
- Map mastery -> category
- Per-kp and aggregate recommendations
4.2 Wire into IntelligentExamController
- Auto-fill difficulty_category when not specified
DEPENDS ON: Phase 3 (calibrated assembly working)
Phase 5: HEALTH MONITORING (ongoing)
5.1 CalibrationHealthMonitor
- Scheduled artisan command
- Drift detection, coverage tracking
- calibration_health_snapshots table
5.2 Alert logic
- Flag when Brier degrades, coverage drops, drift detected
DEPENDS ON: Phase 3 (need production calibration data flowing)
Phase 1 (Validation)
|
v
Phase 2 (Standardization) -----> Phase 3 (Assembly Integration)
|
v
Phase 4 (Adaptive Matching)
|
v
Phase 5 (Health Monitoring)
Phase 2 can run in parallel with Phase 1 since it does not depend on validation results. It only centralizes existing normalization logic. Phase 3 requires both Phase 1 PASS and Phase 2 completion. Phases 4 and 5 are sequential after Phase 3.
calibration_health_snapshotsCREATE TABLE calibration_health_snapshots (
id BIGINT UNSIGNED AUTO_INCREMENT PRIMARY KEY,
snapshot_date DATE NOT NULL,
total_questions INT UNSIGNED DEFAULT 0,
calibrated_count INT UNSIGNED DEFAULT 0,
coverage_pct DECIMAL(5,2) DEFAULT 0,
avg_brier_score DECIMAL(8,6) DEFAULT NULL,
avg_logloss DECIMAL(8,6) DEFAULT NULL,
pearson_correlation DECIMAL(8,4) DEFAULT NULL,
mean_abs_residual DECIMAL(8,4) DEFAULT NULL,
health_scale_avg DECIMAL(5,3) DEFAULT NULL,
drift_flag TINYINT(1) DEFAULT 0,
drift_details JSON DEFAULT NULL,
action VARCHAR(32) DEFAULT 'none',
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
UNIQUE KEY idx_snapshot_date (snapshot_date)
);
backtest_resultsCREATE TABLE backtest_results (
id BIGINT UNSIGNED AUTO_INCREMENT PRIMARY KEY,
run_id VARCHAR(64) NOT NULL,
cutoff_date DATE NOT NULL,
question_bank_id BIGINT UNSIGNED NOT NULL,
training_attempts INT UNSIGNED DEFAULT 0,
test_attempts INT UNSIGNED DEFAULT 0,
predicted_difficulty DECIMAL(6,4) DEFAULT NULL,
observed_error_rate DECIMAL(6,4) DEFAULT NULL,
brier_score DECIMAL(8,6) DEFAULT NULL,
absolute_error DECIMAL(6,4) DEFAULT NULL,
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
INDEX idx_run_id (run_id),
INDEX idx_cutoff (cutoff_date)
);
No changes needed to existing tables. The question_difficulty_calibrations table schema is sufficient.
The calibrated difficulty is an overlay on top of the original difficulty. The QuestionDifficultyResolver already implements this correctly: calibrated value takes priority, original value is the fallback. This must remain the design. Never write calibrated values back to questions.difficulty.
Use a deterministic gate (backtest PASS/FAIL) rather than a manual feature flag to enable calibration in production. The gate should be an artisan command that sets a config value or database flag after validation passes. This prevents human error from enabling an unvalidated algorithm.
When recommending difficulty_category for a student, compute per-knowledge-point mastery and map each to a category. If an exam covers multiple knowledge points, use the weighted average of their recommended categories, weighted by the student's weakness level (weaker knowledge points get more weight to avoid overwhelming the student).
The existing getHealthScaleForType() already provides inline health adjustment. The new CalibrationHealthMonitor serves a different purpose: longitudinal tracking and alerting. It should NOT modify calibration behavior directly; instead, it produces reports that humans review to decide if algorithm parameters need adjustment.
When validating the calibration algorithm, split data by time (cutoff date) rather than randomly. This is critical because:
question_difficulty_calibrations migration schema (HIGH confidence)