Project Research Summary

Project: Math CMS -- Difficulty Calibration & Intelligent Exam Domain: K12 educational assessment -- difficulty calibration validation, pipeline wiring, adaptive exam assembly Researched: 2026-04-16 Confidence: HIGH

Executive Summary

This project is a difficulty calibration refinement for an existing K12 math learning platform built on Laravel. The system already has a working calibration algorithm (stratified_residual_eb_v2), a calibration data pipeline, and an exam assembly engine -- but the calibration loop is disconnected from production. The calibrated difficulty values exist in a separate table and are never used when assembling exams. This means the entire calibration effort runs in isolation, producing values that have zero impact on the exams students receive.

The recommended approach follows a strict gate-based progression: (1) validate the calibration algorithm against held-out historical data using temporal walk-forward backtesting, (2) fix the dual difficulty scale bug (0-1 vs 0-5 values mixed in the same column), (3) wire validated calibration into the exam assembly pipeline with shadow mode comparison before full activation, and (4) build mastery-based difficulty category recommendation on top of the validated foundation. The only new dependency is markrogoyski/math-php ^2.13 for statistical validation metrics (Pearson/Spearman correlation, Brier score decomposition, confidence intervals). Everything else is built on existing code and infrastructure.

The key risks are: circular validation (testing on training data), the dual difficulty scale silently corrupting calibration baselines, and low-sample-size questions dominating results. The research identifies 16 pitfalls total, with 8 rated critical or high severity. The most important mitigation is the temporal train/test split for validation -- without this, the entire effort is built on unverified metrics. A secondary risk is the health monitor degeneracy spiral, where bad initial calibration causes the system to become too cautious to self-correct.

Key Findings

Recommended Stack

The project uses the existing Laravel/PHP/MySQL/Redis stack. The sole addition is markrogoyski/math-php ^2.13, a pure-PHP statistics library providing correlation functions, significance testing, probability distributions, and descriptive statistics needed for backtest validation. No external services, no Python bridges, no IRT frameworks.

Core technologies:

Laravel ^12.0 / PHP ^8.2: Existing backend -- no framework changes needed
MySQL: All calibration data lives in questions, question_difficulty_calibrations, paper_questions, papers tables
math-php ^2.13 (NEW): Statistical validation -- Pearson/Spearman correlation, Brier score decomposition, Beta distribution for credible intervals, ANOVA for cross-type comparison
PHPUnit ^11.5.3: Automated regression tests for calibration accuracy in CI

Expected Features

Must have (table stakes):

TS-1: Historical data backtesting with temporal split -- validates the algorithm before production use; without this, everything else is unverified
TS-2: Difficulty scale normalization -- fixes a correctness bug where 0-5 scale values get treated as 0-1 scale values, inverting difficulty for affected questions
TS-3: Calibrated difficulty wired into exam assembly -- QuestionDifficultyResolver exists but is never called in the main assembly path
TS-4: Difficulty distribution enabled by default -- DifficultyDistributionService is fully implemented but enable_difficulty_distribution defaults to false
TS-5: Calibration coverage metrics -- shows what percentage of questions have calibrated values
TS-6: Per-question calibration confidence indicator -- surfaces sample-size-based confidence levels

Should have (differentiators):

D-1: Mastery-based difficulty category auto-recommendation -- maps per-knowledge-point mastery to difficulty category, eliminating manual tier selection
D-2: Shadow mode comparison logging -- runs both raw and calibrated paths in parallel without affecting output
D-5: Outlier quarantine with human review -- flags questions whose calibration delta exceeds 0.30 from original

Defer (v2+):

D-3 (Health dashboard): Needs 30+ days of accumulated snapshot data
D-4 (Drift detection): Requires health dashboard running first
D-6 (Per-knowledge-point breakdown): Nice-to-have analytics
Full CAT (computerized adaptive testing): Fundamentally different architecture, not this scope

Architecture Approach

The target architecture adds five new service classes as layers on top of the existing system, following a strict dependency chain. The calibration algorithm itself is NOT modified -- the project validates it and connects it, not rewrites it.

Major components:

CalibrationBacktestService -- temporal walk-forward validation producing PASS/FAIL gate; blocks all downstream work
DifficultyNormalizationService -- centralizes the existing normalizeDifficultyValue() logic at the question-loading boundary
QuestionDifficultyResolver (existing, needs wiring) -- applies calibrated-over-original priority chain in the assembly path
CalibrationVerificationGate -- post-calibration sanity checks: range validation, outlier flagging, quarantine
DifficultyCategoryRecommender -- maps MasteryCalculator output (per-knowledge-point mastery) to difficulty category via empirical thresholds

Critical Pitfalls

Circular validation (Pitfall 1) -- Testing calibration on the same data that produced it gives falsely good metrics. Prevention: strict temporal train/test split, walk-forward validation.
Dual difficulty scale corruption (Pitfall 2) -- The > 1.0 heuristic for distinguishing 0-1 from 0-5 scale fails for values in the 0.0-1.0 overlap zone. Prevention: audit distribution, add difficulty_scale flag column, or one-time migration.
Low sample size dominance (Pitfall 3) -- Online calibration mode has no minimum sample threshold; a single answer can shift difficulty by 0.11. Prevention: enforce minimum weighted_attempts threshold (15-20) in the resolver before using calibrated values.
Health monitor degeneracy (Pitfall 8) -- Bad initial calibration triggers health monitor caution, which slows correction, which keeps predictions bad, which triggers more caution. Prevention: raise health_scale floor to 0.6, add reset mechanism after 14 days below 0.7.
Stratified baseline self-reference (Pitfall 4) -- Baselines computed from the same responses being calibrated create regression-to-the-mean artifacts. Prevention: leave-one-out baseline computation, audit stratum sizes.

Implications for Roadmap

Phase 1: Validation and Data Audit

Rationale: Everything downstream depends on knowing whether the calibration algorithm actually works. This phase must complete before any production wiring. Scale normalization must happen first because inconsistent input data produces meaningless validation results.

Delivers: PASS/FAIL verdict on calibration accuracy, data quality audit revealing scale and coverage issues, consistent 0-1 difficulty scale across all questions

Addresses: TS-1 (backtesting), TS-2 (scale normalization), TS-5 (coverage metrics), TS-6 (confidence indicators), D-2 (shadow mode logging -- can activate immediately after PASS)

Avoids: Pitfall 1 (circular validation), Pitfall 2 (dual scale), Pitfall 3 (low sample size -- audit reveals extent), Pitfall 7 (time decay -- test with and without), Pitfall 8 (health monitor degeneracy -- audit behavior during validation), Pitfall 16 (default 0.5 contamination -- count during audit)

Phase 2: Assembly Integration

Rationale: Requires Phase 1 PASS gate open. Wires the validated calibration into the production exam assembly path using the gate-based activation pattern. Shadow mode runs first, then per-assembly-type rollout with feature flags.

Delivers: Calibrated difficulty used in all exam assembly paths, difficulty distribution enabled by default, confidence-based fallback for low-sample questions

Addresses: TS-3 (calibrated difficulty in assembly), TS-4 (distribution enabled), AF-5 enforcement (minimum sample size)

Avoids: Pitfall 5 (no A/B testing -- shadow mode + per-type flags), Pitfall 6 (boundary effects -- soft boundaries), Pitfall 9 (mode inconsistency -- align thresholds), Pitfall 10 (pool exhaustion -- pre-analyze coverage), Pitfall 15 (race condition -- row-level locks)

Uses: math-php for ongoing monitoring metrics, CalibrationVerificationGate for outlier quarantine

Phase 3: Adaptive Matching

Rationale: Requires calibrated assembly working reliably in production (Phase 2 stable). Builds the intelligence layer that maps student mastery to recommended difficulty, closing the "answer -> calibrate -> precise exam -> re-answer" loop described in the project vision.

Delivers: Per-knowledge-point difficulty category recommendation, automatic exam tier selection, student-facing confidence indicator

Addresses: D-1 (mastery-based auto-recommendation), D-5 (outlier quarantine), D-7 (student confidence indicator)

Avoids: Pitfall 11 (question type heterogeneity -- guessing correction for choice questions), Pitfall 12 (mastery mapping without ground truth -- use historical data to find optimal mapping), Pitfall 13 (feedback loop divergence -- track drift, periodic expert anchoring)

Phase 4: Health Monitoring (Ongoing)

Rationale: Needs 30+ days of production calibration data flowing through the wired pipeline. Builds longitudinal tracking and drift detection infrastructure.

Delivers: Daily health snapshots, drift detection alerts, calibration coverage tracking over time

Addresses: D-3 (health dashboard), D-4 (drift detection)

Avoids: Pitfall 13 (feedback loop divergence -- provides early warning), Pitfall 14 (JSON bloat -- extract health metrics to dedicated columns)

Phase Ordering Rationale

Phase 1 must come first because connecting unvalidated calibration to production could make exams worse, not better. The PASS/FAIL gate is a hard prerequisite.
Scale normalization (TS-2) must precede or coincide with backtesting (TS-1) because inconsistent input data produces meaningless validation results.
Phase 2 (assembly wiring) is separate from Phase 1 (validation) because they have different risk profiles: validation is read-only analysis, wiring modifies production behavior.
Phase 3 (adaptive matching) requires Phase 2 because the category recommender needs calibrated difficulty actively working in the assembly pipeline to validate its recommendations.
Phase 4 (health monitoring) is last because it needs accumulated production data from the wired pipeline to be meaningful.
The critical path is: TS-1 -> TS-2 -> TS-3 -> TS-4 -> D-1, as identified in FEATURES.md.

Research Flags

Phases likely needing deeper research during planning:

Phase 1 (Validation): The backtesting approach (walk-forward with Brier score) is well-documented, but the specific PASS/FAIL thresholds (Pearson > 0.3, Mean Brier < 0.20) need validation against actual data distributions. The dual scale audit may reveal a worse problem than expected, requiring a data migration strategy. The stratified baseline self-reference issue (Pitfall 4) may require leave-one-out computation, which has performance implications for large datasets.
Phase 3 (Adaptive Matching): The mastery-to-category threshold mapping needs empirical grounding using historical (mastery, difficulty_category, score) data. This is not a standard library problem -- it requires domain-specific analysis. Question type heterogeneity (choice vs fill-in vs open-ended) may require separate mappings.

Phases with standard patterns (skip research-phase):

Phase 2 (Assembly Integration): Wiring existing services together, adding feature flags, shadow mode logging. All architectural patterns are well-defined in ARCHITECTURE.md.
Phase 4 (Health Monitoring): Standard scheduled job + dashboard pattern. The computation logic already exists in getHealthScaleForType().

Confidence Assessment

Area	Confidence	Notes
Stack	HIGH	Direct codebase analysis + Packagist verification. Only one new dependency (math-php). Existing stack well-understood.
Features	HIGH	Features derived from codebase gap analysis (what exists vs what is disconnected). Competitive comparison (ALEKS, IXL, Khan Academy) at MEDIUM confidence for specific feature claims.
Architecture	HIGH	Target architecture derived from existing component structure. All new services follow established Laravel patterns. Dependency chain is clear.
Pitfalls	HIGH	Codebase pitfalls directly observed in source code. Domain pitfalls (IRT, calibration theory) are well-established in psychometric literature.

Overall confidence: HIGH

Gaps to Address

Dual scale severity unknown: The > 1.0 heuristic may work fine if most questions use the same scale, or it may silently corrupt 30%+ of difficulty values. Phase 1 must start with a data distribution audit before any validation can be trusted.
Brier Skill Score threshold: Research does not establish what constitutes a "good enough" BSS for K12 math difficulty calibration. Industry practice varies by domain. Recommend running the backtest first, then setting the threshold based on the distribution of results.
Calibration algorithm parameter sensitivity: The existing algorithm has multiple tuning parameters (half_life_days, max_step, shrinkage constants). The backtest should test at least 2-3 parameter settings to understand sensitivity, but the research recommends against automatic tuning (AF-6).
Question pool coverage by knowledge point: Enabling difficulty distribution requires sufficient questions at each difficulty level per knowledge point. If certain knowledge points have sparse pools, the distribution strategy will fail or fall back heavily. This needs a coverage analysis before Phase 2 rollout.
Mastery-to-category mapping thresholds: The proposed thresholds (mastery 0.30 -> category 1, 0.50 -> category 2, etc.) are reasonable starting points but need empirical validation against historical score data. The "zone of proximal development" target (60-75% correct) should drive the actual thresholds.
Sample size requirements: The minimum number of attempts before a calibrated difficulty value should be used in production is set at 5 in the analyzer but 10 in the calibration constraints. The optimal threshold should be explored during backtesting.
Long-term calibration drift: The algorithm has a 45-day half-life for time decay, but the system lacks explicit drift detection. Formal drift detection (comparing rolling windows) is deferred to Phase 4.

Sources

Primary (HIGH confidence)

Direct codebase analysis: QuestionDifficultyCalibrationService (974 lines), QuestionDifficultyCalibrationAnalyzer (608 lines), QuestionDifficultyResolver (88 lines), DifficultyDistributionService (219 lines), IntelligentExamController (1267 lines), LearningAnalyticsService, MasteryCalculator, ExamAnswerAnalysisService
markrogoyski/math-php v2.13.0 on Packagist -- verified actively maintained, pure PHP, 3k+ GitHub stars
Brier score decomposition and proper scoring rules -- standard psychometric methodology
IRT theory and CAT algorithms -- well-established in psychometric literature (Lord & Novick, Wainer)

Secondary (MEDIUM confidence)

ALEKS Knowledge Space Theory -- proprietary, details inferred from published research and documentation
IXL Real-Time Diagnostic and Trouble Spots -- feature descriptions from product documentation
Khan Academy mastery system -- based on published descriptions and training data
Adaptive testing design patterns -- Wainer "Computerized Adaptive Testing: A Primer"

Tertiary (LOW confidence)

Student-facing confidence indicator UX patterns -- inferred from Khan Academy progress bars, not directly verified in this domain
Calibration drift detection thresholds (7-day vs 30-day comparison, >20% increase flag) -- reasonable but empirically derived

Research completed: 2026-04-16 Ready for roadmap: yes

SUMMARY.md 16 KB Verlauf Originalformat