Prechádzať zdrojové kódy

docs: complete project research for difficulty calibration & intelligent exam

Synthesize STACK, FEATURES, ARCHITECTURE, and PITFALLS research into
unified SUMMARY.md with roadmap implications and phase suggestions.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
yemeishu 4 týždňov pred
rodič
commit
554bacec14

+ 566 - 0
.planning/research/ARCHITECTURE.md

@@ -0,0 +1,566 @@
+# Architecture Patterns
+
+**Domain:** K12 math difficulty calibration and intelligent exam matching
+**Researched:** 2026-04-16
+**Confidence:** HIGH (based on direct codebase analysis); MEDIUM (general adaptive testing patterns from training data)
+
+## Current Architecture (As-Built)
+
+```
+                           ANSWER FLOW (already working)
+                           ============================
+
+  Student answers exam
+         |
+         v
+  ExamAnswerAnalysisService.analyzeExamAnswers()
+         |
+         +---> MasteryCalculator (knowledge point mastery)
+         +---> KnowledgeMasteryService (persist mastery)
+         +---> LocalAIAnalysisService (update mastery)
+         +---> MistakeBookService (add to mistake book)
+         +---> QuestionDifficultyCalibrationService.updateOnlineFromPaper()
+                  |
+                  v
+         question_difficulty_calibrations table (upsert)
+
+
+                           ASSEMBLY FLOW (partially working)
+                           ================================
+
+  POST /api/intelligent-exam
+         |
+         v
+  IntelligentExamController.store()
+         |
+         v
+  AssembleExamTaskJob (queued)
+         |
+         v
+  LearningAnalyticsService.generateIntelligentExam()
+         |
+         +---> selectQuestions() -- uses raw questions.difficulty
+         +---> applyTypeAwareDifficultyDistribution()
+         |         |
+         |         v
+         |    DifficultyDistributionService
+         |    (only if enable_difficulty_distribution=true,
+         |     which defaults to FALSE)
+         |
+         v
+  QuestionDifficultyResolver.applyCalibratedDifficulty()
+  (exists but NOT called in the main assembly path)
+```
+
+### Identified Gaps in Current Architecture
+
+| Gap | Location | Impact |
+|-----|----------|--------|
+| Calibration values not used in assembly | `LearningAnalyticsService` selects questions using raw `questions.difficulty` | Assembled exams use uncalibrated difficulty |
+| `enable_difficulty_distribution` defaults false | `LearningAnalyticsService` line 1554 | Distribution strategy never activates unless caller explicitly enables |
+| No auto difficulty_category recommendation | No service maps mastery to category | Teachers must manually pick tier; no student-level adaptation |
+| No backtesting validation | `QuestionDifficultyCalibrationAnalyzer` reports but does not validate | Algorithm accuracy unknown before production use |
+| Dual difficulty scale (0-1 vs 0-5) | `normalizeDifficultyValue()` divides by 5 if > 1.0 | Inconsistent source data enters calibration |
+
+## Recommended Target Architecture
+
+### Component Diagram
+
+```
++===================================================================================+
+|                              TARGET ARCHITECTURE                                   |
++===================================================================================+
+
+ LAYER 1: VALIDATION (must complete before anything else)
+ +--------------------------------------------------------+
+ |                                                        |
+ |  CalibrationBacktestService                            |
+ |    |-- backtestAgainstHistory(cutoffDate)              |
+ |    |-- computeBrierScores(questionIds)                 |
+ |    |-- computePearsonCorrelation()                     |
+ |    |-- produceValidationReport()                       |
+ |    +-- PASS/FAIL gate: algo accuracy threshold         |
+ |                                                        |
+ |  Data: reads paper_questions + questions (historical)  |
+ |  Writes: backtest_results table (or JSON export)       |
+ +--------------------------------------------------------+
+           |
+           | PASS gate opens production use
+           v
+
+ LAYER 2: CALIBRATION FEEDBACK LOOP (enhance existing)
+ +--------------------------------------------------------+
+ |                                                        |
+ |  ExamAnswerAnalysisService                            |
+ |    |-- (existing) analyzeExamAnswers()                 |
+ |    +-- (existing) recalibrateQuestionDifficulty()      |
+ |                                                        |
+ |  QuestionDifficultyCalibrationService                  |
+ |    |-- updateOnlineFromPaper()   [existing, per-paper] |
+ |    |-- recalibrateQuestionIds()  [existing, batch]     |
+ |    +-- getHealthScaleForType()   [existing, monitoring]|
+ |                                                        |
+ |  NEW: CalibrationVerificationGate                      |
+ |    |-- validateCalibratedRange(questionIds)             |
+ |    |-- flagOutliers(threshold)                          |
+ |    +-- quarantineBadCalibrations()                      |
+ |                                                        |
+ |  Data: answer --> calibrate --> verify --> use         |
+ +--------------------------------------------------------+
+           |
+           v
+
+ LAYER 3: ASSEMBLY INTEGRATION (connect calibration to exam)
+ +--------------------------------------------------------+
+ |                                                        |
+ |  DifficultyNormalizationService  [NEW]                 |
+ |    |-- normalize(questionId) -> float [0,1]            |
+ |    |-- batchNormalize(questionIds) -> map              |
+ |    +-- resolves 0-1 vs 0-5 ambiguity at read time     |
+ |                                                        |
+ |  QuestionDifficultyResolver  [existing, expand usage]  |
+ |    |-- applyCalibratedDifficulty(questions) -> arr     |
+ |    +-- MUST be called in assembly path                 |
+ |                                                        |
+ |  LearningAnalyticsService                              |
+ |    |-- generateIntelligentExam()                       |
+ |    |   +-- CALL DifficultyNormalizationService first    |
+ |    |   +-- CALL QuestionDifficultyResolver second       |
+ |    |   +-- SET enable_difficulty_distribution = true    |
+ |    +-- remove hard-coded default false                 |
+ |                                                        |
+ |  DifficultyDistributionService  [existing]             |
+ |    |-- calculateDistribution(category, total)          |
+ |    +-- groupQuestionsByDifficultyRange()               |
+ +--------------------------------------------------------+
+           |
+           v
+
+ LAYER 4: ADAPTIVE MATCHING (mastery-based difficulty selection)
+ +--------------------------------------------------------+
+ |                                                        |
+ |  DifficultyCategoryRecommender  [NEW]                  |
+ |    |-- recommendForStudent(studentId, kpCodes) -> cat  |
+ |    |-- recommendForKnowledgePoint(studentId, kp) -> cat|
+ |    +-- uses MasteryCalculator + calibration data       |
+ |                                                        |
+ |  MasteryCalculator  [existing]                         |
+ |    |-- calculateMasteryLevel(studentId, kpCode)        |
+ |    +-- returns mastery [0,1] + confidence + trend      |
+ |                                                        |
+ |  Mapping logic:                                        |
+ |    mastery [0.0, 0.30) -> category 0 (zero-foundation) |
+ |    mastery [0.30, 0.50) -> category 1 (foundation)     |
+ |    mastery [0.50, 0.70) -> category 2 (intermediate)   |
+ |    mastery [0.70, 0.85) -> category 3 (advanced)       |
+ |    mastery [0.85, 1.00) -> category 4 (competition)    |
+ +--------------------------------------------------------+
+           |
+           v
+
+ LAYER 5: HEALTH MONITORING (continuous)
+ +--------------------------------------------------------+
+ |                                                        |
+ |  CalibrationHealthMonitor  [NEW]                       |
+ |    |-- detectDrift(windowDays) -> drift report         |
+ |    |-- accuracyTrend(days) -> accuracy over time       |
+ |    |-- calibrationCoverage() -> % questions calibrated |
+ |    +-- scheduled artisan command (daily/weekly)        |
+ |                                                        |
+ |  Existing health mechanisms:                           |
+ |    |-- getHealthScaleForType() in CalibrationService   |
+ |    +-- recent_events in algorithm_meta (per-question)  |
+ |                                                        |
+ |  NEW: calibration_health_snapshots table               |
+ |    |-- date, total_calibrated, avg_brier,              |
+ |    |   coverage_pct, drift_flag, action                |
+ +--------------------------------------------------------+
+```
+
+### Component Boundaries
+
+| Component | Responsibility | Communicates With | New/Existing |
+|-----------|---------------|-------------------|--------------|
+| `CalibrationBacktestService` | Validate algorithm accuracy against historical data | Reads `paper_questions`, `questions`, `papers`. Writes report output. | NEW |
+| `QuestionDifficultyCalibrationService` | Core calibration algorithm (stratified_residual_eb_v2) | Called by `ExamAnswerAnalysisService`, `CalibrationBacktestService` | EXISTING |
+| `CalibrationVerificationGate` | Post-calibration sanity checks (range, outlier detection) | Reads `question_difficulty_calibrations`. Flags problematic entries. | NEW |
+| `DifficultyNormalizationService` | Unify 0-1 / 0-5 scale at read boundary | Called by `LearningAnalyticsService` during question loading | NEW |
+| `QuestionDifficultyResolver` | Apply calibrated difficulty to question arrays, calibrated-first | Called in assembly path by `LearningAnalyticsService` | EXISTING (needs wiring) |
+| `DifficultyDistributionService` | Calculate difficulty buckets per category | Called by `LearningAnalyticsService` when distribution enabled | EXISTING |
+| `DifficultyCategoryRecommender` | Map student mastery to recommended difficulty category | Reads `MasteryCalculator`. Used by `IntelligentExamController` | NEW |
+| `MasteryCalculator` | Calculate per-knowledge-point mastery levels | Existing, unchanged | EXISTING |
+| `CalibrationHealthMonitor` | Detect calibration drift, coverage gaps, accuracy degradation | Reads `question_difficulty_calibrations`. Writes health snapshots. | NEW |
+| `LearningAnalyticsService` | Orchestrate question selection and difficulty distribution | Must call normalization + resolver + distribution | EXISTING (needs modification) |
+
+### Data Flow
+
+```
+QUESTION DIFFICULTY LIFECYCLE
+=============================
+
+  questions.difficulty (original, immutable)
+         |
+         v
+  DifficultyNormalizationService.normalize()
+         |  (resolves 0-1 vs 0-5, stores original_difficulty)
+         v
+  question_difficulty_calibrations.original_difficulty
+         |
+         +---[calibration loop]---> calibrated_difficulty
+         |                              |
+         v                              v
+  QuestionDifficultyResolver.applyCalibratedDifficulty()
+         |
+         |  Returns: calibrated if exists, else original (normalized)
+         v
+  DifficultyDistributionService.groupQuestionsByDifficultyRange()
+         |
+         v
+  Selected questions for exam assembly
+
+
+ANSWER-TO-CALIBRATION FEEDBACK LOOP
+====================================
+
+  Student submits exam answers
+         |
+         v
+  ExamAnswerAnalysisService.analyzeExamAnswers()
+         |
+         +---> MasteryCalculator (update knowledge mastery)
+         +---> QuestionDifficultyCalibrationService.updateOnlineFromPaper()
+                  |
+                  +-- per-question: compute residual, apply shrinkage, clamp
+                  +-- upsert to question_difficulty_calibrations
+                  +-- append to recent_events in algorithm_meta
+                  |
+                  v
+         CalibrationVerificationGate (NEW)
+                  |
+                  +-- check calibrated_difficulty in [0.01, 0.99]
+                  +-- flag if delta > 0.30 from original
+                  +-- quarantine if Brier score deteriorating
+                  |
+                  v
+         Health monitor caches invalidated
+         (getHealthScaleForType will recompute next call)
+
+
+BACKTESTING VALIDATION FLOW
+============================
+
+  CalibrationBacktestService.backtestAgainstHistory(cutoffDate)
+         |
+         v
+  1. Load all questions with >= N attempts before cutoffDate
+  2. Split: training set (before cutoff) vs test set (after cutoff)
+  3. Run calibration on training data only
+  4. For each question in test set:
+     - predicted = calibrated_difficulty from training
+     - actual = observed error rate in test period
+     - Brier score = (predicted - actual)^2
+  5. Aggregate metrics:
+     - Mean Brier score (lower = better, < 0.15 is acceptable)
+     - Pearson correlation (predicted vs actual, > 0.4 is acceptable)
+     - Calibration coverage (% questions with enough data)
+     - MAE (mean absolute error, < 0.15 is acceptable)
+  6. PASS gate:
+     - Pearson > 0.3 AND Mean Brier < 0.20
+     - If FAIL: algorithm needs tuning, do NOT enable in production
+         |
+         v
+  Report: JSON/CSV output + PASS/FAIL verdict
+
+
+MASTERY-TO-DIFFICULTY MATCHING FLOW
+====================================
+
+  Exam request (student_id + kp_codes)
+         |
+         v
+  DifficultyCategoryRecommender.recommendForStudent()
+         |
+         +-- for each kp_code:
+         |    MasteryCalculator.calculateMasteryLevel(studentId, kp)
+         |    -> mastery [0,1], confidence, trend
+         |
+         +-- aggregate mastery across kp_codes (weighted average)
+         +-- map to category:
+         |    mastery -> category via threshold table
+         |    adjust for trend: trending up -> +0.5 category push
+         |    floor at 0, cap at 4
+         |
+         +-- return: recommended category + confidence + reasoning
+         |
+         v
+  IntelligentExamController uses recommended category
+         |
+         v
+  LearningAnalyticsService with enable_difficulty_distribution=true
+```
+
+## Patterns to Follow
+
+### Pattern 1: Gate-Based Progressive Activation
+
+**What:** A calibration value must pass validation before it can influence production behavior. Once validated, components progressively unlock.
+**When:** Any system where unvalidated statistical estimates would harm user experience.
+**Why this matters here:** The project explicitly requires "validation before production use." The current code already has calibration running, but it is not connected to assembly. This is correct; the backtest gate formalizes the transition.
+
+```
+Gate states:
+  LOCKED   -- calibration runs, values stored, NOT used in assembly
+  TESTED   -- backtest passed, enable for shadow mode (log but don't act)
+  ACTIVE   -- fully enabled in production assembly path
+```
+
+**Example implementation:**
+
+```php
+// In a config or database table
+'calibration_gate' => 'locked',  // locked | tested | active
+
+// In LearningAnalyticsService, during assembly:
+if (config('calibration_gate') === 'active') {
+    $questions = $resolver->applyCalibratedDifficulty($questions);
+}
+```
+
+### Pattern 2: Difficulty Source Priority Chain
+
+**What:** When multiple difficulty values exist for a question, follow a deterministic priority chain rather than ad-hoc logic.
+**When:** Any lookup where calibrated, original, and estimated values coexist.
+**Why:** The current `QuestionDifficultyResolver` already implements this pattern correctly (calibrated > original). It just needs to be consistently called.
+
+```php
+// Priority chain (already implemented in QuestionDifficultyResolver):
+// 1. calibrated_difficulty (from question_difficulty_calibrations)
+// 2. normalized questions.difficulty (0-1 scale, divide-by-5 if needed)
+// 3. fallback 0.5 (moderate default)
+```
+
+### Pattern 3: Shadow Mode Before Activation
+
+**What:** Before enabling calibrated difficulty in actual exam assembly, run both paths in parallel and compare results without affecting output.
+**When:** Connecting a validated but previously-disconnected statistical system to production.
+**Why:** Even with backtest validation, real-time behavior may differ from historical backtest. Shadow mode catches integration bugs.
+
+```php
+// In LearningAnalyticsService assembly:
+$rawQuestions = $selectedQuestions; // current behavior
+$calibratedQuestions = $resolver->applyCalibratedDifficulty($rawQuestions);
+
+// Log comparison without using calibrated values yet
+Log::info('Shadow mode difficulty comparison', [
+    'raw_avg' => collect($rawQuestions)->avg('difficulty'),
+    'calibrated_avg' => collect($calibratedQuestions)->avg('difficulty'),
+    'diff_count' => count(array_filter($calibratedQuestions, fn($q) =>
+        ($q['difficulty_source'] ?? '') === 'calibrated'
+    )),
+]);
+
+// Use raw (unchanged behavior) until gate opens
+$selectedQuestions = $rawQuestions;
+```
+
+### Pattern 4: Stratified Baseline with Residual Adjustment
+
+**What:** Already implemented in the codebase. The calibration algorithm computes expected error rates per (question_type, difficulty_category) stratum, then adjusts based on the residual (observed - expected).
+**When:** This is the core calibration algorithm. No changes needed to the algorithm itself per project scope.
+
+The existing algorithm is well-structured:
+- Global baselines: `buildGlobalBaselines()` computes per-stratum error rates
+- Online update: `estimateOnlineBySingleOutcome()` processes one answer event
+- Batch update: `estimateByStratifiedResidual()` processes historical data
+- Health scaling: `getHealthScaleForType()` auto-reduces step size when degrading
+
+### Pattern 5: Time-Decay Weighted Statistics
+
+**What:** Weight recent observations more heavily than old ones using exponential decay. Already implemented with 45-day half-life.
+**When:** Any aggregation of student performance or calibration data.
+**Why:** K12 students improve; old responses are less predictive. The existing 45-day half-life is reasonable.
+
+## Anti-Patterns to Avoid
+
+### Anti-Pattern 1: Backfilling questions.difficulty
+
+**What:** Writing calibrated values back to `questions.difficulty`.
+**Why bad:** Destroys the original reference value, makes debugging impossible, violates project constraint.
+**Instead:** Keep the dual-table design. `questions.difficulty` is append-only immutable. `question_difficulty_calibrations` is the mutable overlay.
+
+### Anti-Pattern 2: Global Difficulty Category Override
+
+**What:** Computing one difficulty_category for a student across all knowledge points and applying it everywhere.
+**Why bad:** A student may be advanced in algebra but beginner in geometry. Global category creates mismatched exams.
+**Instead:** Per-knowledge-point mastery -> per-knowledge-point difficulty recommendation. Aggregate only when exam spans multiple knowledge points.
+
+### Anti-Pattern 3: Calibration Without Verification
+
+**What:** Wiring calibration directly into the assembly path without the backtest validation step.
+**Why bad:** If the algorithm has systematic bias (e.g., always overestimates difficulty for certain question types), it makes exams worse, not better.
+**Instead:** Backtest first. The backtest is a prerequisite gate, not an optional report.
+
+### Anti-Pattern 4: Dual-Scale Leakage
+
+**What:** Mixing 0-5 scale difficulty values with 0-1 scale values in the same computation.
+**Why bad:** A 0-5 value of 0.4 (easy) gets treated as 0.4 on 0-1 scale (hard), producing inverted difficulty estimates.
+**Instead:** Normalize at the read boundary. The existing `normalizeDifficultyValue()` in `QuestionDifficultyCalibrationService` handles this for calibration input, but `LearningAnalyticsService` does not normalize when loading questions for assembly. This must be fixed.
+
+### Anti-Pattern 5: Calibration Feedback Loop Without Rate Limiting
+
+**What:** Allowing calibration to update on every single answer without any dampening.
+**Why bad:** A single anomalous cohort (e.g., a class that all guesses randomly) can corrupt calibration values.
+**Instead:** The existing algorithm handles this well with: shrinkage (prior pulls toward original), step limits, minimum sample thresholds, and health scaling. Do not remove these safeguards.
+
+## Scalability Considerations
+
+| Concern | At 100 questions | At 10K questions | At 100K questions |
+|---------|------------------|-------------------|-------------------|
+| Calibration table size | Negligible | ~10K rows, fast with index on `question_bank_id` | ~100K rows; add composite index on `(calibrated_difficulty, updated_at)` |
+| Backtest computation | < 1 second | 5-30 seconds depending on attempt count | Minutes; run as queued job, cache results |
+| Per-answer calibration | < 10ms (single upsert) | < 10ms | < 10ms (indexed lookup + single upsert) |
+| Health monitoring | Negligible (scans recent rows) | 1-5 seconds (parsing algorithm_meta JSON) | 5-30 seconds; extract health metrics to dedicated columns |
+| Mastery-to-category recommendation | < 50ms (1 mastery lookup) | < 50ms | < 100ms (batch mastery lookup for multiple kp_codes) |
+| `applyCalibratedDifficulty` batch | < 5ms | < 20ms (WHERE IN) | < 100ms; add chunking for > 1000 IDs |
+
+## Suggested Build Order
+
+```
+Phase 1: VALIDATION (must be first -- blocks everything else)
+  1.1  CalibrationBacktestService
+       - Reads historical data
+       - Computes Pearson, Brier, MAE
+       - Produces PASS/FAIL report
+  DEPENDS ON: existing QuestionDifficultyCalibrationService, existing data
+  BLOCKS: Phase 2, 3, 4
+
+Phase 2: DIFFICULTY STANDARDIZATION (no behavioral change)
+  2.1  DifficultyNormalizationService
+       - Extract and centralize normalizeDifficultyValue() logic
+       - Apply at question-loading boundary in LearningAnalyticsService
+  DEPENDS ON: nothing new
+  BLOCKS: Phase 3 (need consistent scale before using calibration)
+
+Phase 3: ASSEMBLY INTEGRATION (wires calibration into production)
+  3.1  Wire QuestionDifficultyResolver into LearningAnalyticsService
+       - Call applyCalibratedDifficulty() in the assembly path
+       - Enable difficulty_distribution by default
+       - Add shadow mode logging first, then activate
+  3.2  CalibrationVerificationGate
+       - Post-upsert sanity checks
+       - Outlier quarantine
+  DEPENDS ON: Phase 1 (PASS gate), Phase 2 (consistent scale)
+  BLOCKS: Phase 4
+
+Phase 4: ADAPTIVE MATCHING (new feature)
+  4.1  DifficultyCategoryRecommender
+       - Map mastery -> category
+       - Per-kp and aggregate recommendations
+  4.2  Wire into IntelligentExamController
+       - Auto-fill difficulty_category when not specified
+  DEPENDS ON: Phase 3 (calibrated assembly working)
+
+Phase 5: HEALTH MONITORING (ongoing)
+  5.1  CalibrationHealthMonitor
+       - Scheduled artisan command
+       - Drift detection, coverage tracking
+       - calibration_health_snapshots table
+  5.2  Alert logic
+       - Flag when Brier degrades, coverage drops, drift detected
+  DEPENDS ON: Phase 3 (need production calibration data flowing)
+```
+
+### Dependency Graph
+
+```
+Phase 1 (Validation)
+    |
+    v
+Phase 2 (Standardization) -----> Phase 3 (Assembly Integration)
+                                       |
+                                       v
+                                Phase 4 (Adaptive Matching)
+                                       |
+                                       v
+                                Phase 5 (Health Monitoring)
+```
+
+Phase 2 can run in parallel with Phase 1 since it does not depend on validation results. It only centralizes existing normalization logic. Phase 3 requires both Phase 1 PASS and Phase 2 completion. Phases 4 and 5 are sequential after Phase 3.
+
+## Data Model Additions
+
+### New Table: `calibration_health_snapshots`
+
+```sql
+CREATE TABLE calibration_health_snapshots (
+    id BIGINT UNSIGNED AUTO_INCREMENT PRIMARY KEY,
+    snapshot_date DATE NOT NULL,
+    total_questions INT UNSIGNED DEFAULT 0,
+    calibrated_count INT UNSIGNED DEFAULT 0,
+    coverage_pct DECIMAL(5,2) DEFAULT 0,
+    avg_brier_score DECIMAL(8,6) DEFAULT NULL,
+    avg_logloss DECIMAL(8,6) DEFAULT NULL,
+    pearson_correlation DECIMAL(8,4) DEFAULT NULL,
+    mean_abs_residual DECIMAL(8,4) DEFAULT NULL,
+    health_scale_avg DECIMAL(5,3) DEFAULT NULL,
+    drift_flag TINYINT(1) DEFAULT 0,
+    drift_details JSON DEFAULT NULL,
+    action VARCHAR(32) DEFAULT 'none',
+    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
+    UNIQUE KEY idx_snapshot_date (snapshot_date)
+);
+```
+
+### New Table: `backtest_results`
+
+```sql
+CREATE TABLE backtest_results (
+    id BIGINT UNSIGNED AUTO_INCREMENT PRIMARY KEY,
+    run_id VARCHAR(64) NOT NULL,
+    cutoff_date DATE NOT NULL,
+    question_bank_id BIGINT UNSIGNED NOT NULL,
+    training_attempts INT UNSIGNED DEFAULT 0,
+    test_attempts INT UNSIGNED DEFAULT 0,
+    predicted_difficulty DECIMAL(6,4) DEFAULT NULL,
+    observed_error_rate DECIMAL(6,4) DEFAULT NULL,
+    brier_score DECIMAL(8,6) DEFAULT NULL,
+    absolute_error DECIMAL(6,4) DEFAULT NULL,
+    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
+    INDEX idx_run_id (run_id),
+    INDEX idx_cutoff (cutoff_date)
+);
+```
+
+No changes needed to existing tables. The `question_difficulty_calibrations` table schema is sufficient.
+
+## Key Design Decisions
+
+### Decision 1: Calibration is an Overlay, Not a Replacement
+
+The calibrated difficulty is an overlay on top of the original difficulty. The `QuestionDifficultyResolver` already implements this correctly: calibrated value takes priority, original value is the fallback. This must remain the design. Never write calibrated values back to `questions.difficulty`.
+
+### Decision 2: Gate-Based Activation, Not Feature Flags
+
+Use a deterministic gate (backtest PASS/FAIL) rather than a manual feature flag to enable calibration in production. The gate should be an artisan command that sets a config value or database flag after validation passes. This prevents human error from enabling an unvalidated algorithm.
+
+### Decision 3: Per-Knowledge-Point Difficulty Recommendation
+
+When recommending difficulty_category for a student, compute per-knowledge-point mastery and map each to a category. If an exam covers multiple knowledge points, use the weighted average of their recommended categories, weighted by the student's weakness level (weaker knowledge points get more weight to avoid overwhelming the student).
+
+### Decision 4: Health Monitoring is Separate from Calibration
+
+The existing `getHealthScaleForType()` already provides inline health adjustment. The new `CalibrationHealthMonitor` serves a different purpose: longitudinal tracking and alerting. It should NOT modify calibration behavior directly; instead, it produces reports that humans review to decide if algorithm parameters need adjustment.
+
+### Decision 5: Backtesting Uses Temporal Split, Not Random Split
+
+When validating the calibration algorithm, split data by time (cutoff date) rather than randomly. This is critical because:
+1. The algorithm includes time decay, so temporal ordering matters
+2. Random splits would leak future information into training
+3. Real deployment processes data chronologically
+
+## Sources
+
+- Direct codebase analysis of all referenced PHP service files (HIGH confidence)
+- Existing `question_difficulty_calibrations` migration schema (HIGH confidence)
+- Adaptive testing and IRT architecture patterns from training data (MEDIUM confidence -- standard patterns in psychometrics literature)
+- Brier score and calibration validation approaches from training data (MEDIUM confidence -- well-established statistical methodology)

+ 127 - 0
.planning/research/FEATURES.md

@@ -0,0 +1,127 @@
+# Feature Landscape
+
+**Domain:** K12 math difficulty calibration and intelligent exam matching
+**Researched:** 2026-04-16
+**Confidence:** HIGH (codebase analysis); MEDIUM (platform feature comparisons based on web research)
+
+## Table Stakes
+
+Features the system must have. Without these, difficulty calibration is meaningless and exam assembly produces poor matches.
+
+| # | Feature | Why Expected | Complexity | Notes |
+|---|---------|--------------|------------|-------|
+| TS-1 | **Historical data backtesting with temporal split** | Every validated adaptive system (ALEKS, Khan Academy, IXL) validates its parameter estimation against held-out data before production use. Without this, the entire calibration pipeline is an unverified hypothesis. | Medium | `QuestionDifficultyCalibrationAnalyzer` already computes per-question stats and Pearson correlation. Needs: temporal train/test split, aggregated Brier score, MAE, PASS/FAIL gate. The data (`paper_questions`, `questions`, `papers`) already exists. |
+| TS-2 | **Difficulty scale normalization (0-1 vs 0-5 unification)** | The codebase has `normalizeDifficultyValue()` in `QuestionDifficultyCalibrationService` for calibration input, but `LearningAnalyticsService` loads raw `questions.difficulty` without normalization during assembly. This is a correctness bug, not a nice-to-have. A 0-5 value of 0.4 treated as 0-1 (40% difficulty) inverts the intended difficulty. | Low | Extract and centralize the existing normalization logic. Apply at the question-loading boundary in `LearningAnalyticsService`. One new service class, no algorithm changes. |
+| TS-3 | **Calibrated difficulty used in exam assembly** | The calibration algorithm (`stratified_residual_eb_v2`) runs on every grading event (`updateOnlineFromPaper`), writes to `question_difficulty_calibrations`, and `QuestionDifficultyResolver.applyCalibratedDifficulty()` exists -- but it is not called in the main assembly path (`LearningAnalyticsService.generateIntelligentExam`). The calibration loop is complete but disconnected. | Low | Wire the existing `QuestionDifficultyResolver` call into `LearningAnalyticsService.selectQuestions()`. The resolver already handles the fallback chain (calibrated > normalized original). Requires TS-1 PASS gate to be open first. |
+| TS-4 | **Difficulty distribution enabled by default** | `enable_difficulty_distribution` defaults to `false` in `LearningAnalyticsService` line 1554. Only `ExamTypeStrategy` sets it to `true`. The API main path never activates distribution. Without distribution, all questions are selected by raw difficulty matching without the tiered low/medium/high bucket strategy. | Low | Change default to `true` after TS-1 and TS-2 are complete. The `DifficultyDistributionService` logic is fully implemented and tested. |
+| TS-5 | **Calibration coverage metrics** | Teachers and admins need to know what percentage of questions have calibrated values. ALEKS reports item parameter coverage; IXL shows diagnostic completeness. If only 20% of questions have calibration data, the system should not claim to use calibrated difficulty. | Low | Count `question_difficulty_calibrations` rows vs total active questions. Add to the existing `AnalyzeQuestionDifficultyCalibrationCommand` output. No new infrastructure needed. |
+| TS-6 | **Per-question calibration confidence indicator** | IRT systems report standard error of estimation per item. The existing `algorithm_meta` stores `recent_events` with Brier/log-loss per event, and the analyzer computes `calibration_effective_attempts` (time-decay weighted sample size). These need to surface as a single confidence metric per question. | Low | Derive from existing data: `effective_attempts >= 10` = HIGH confidence, `5-9` = MEDIUM, `< 5` = LOW. Already computed in the analyzer; just needs a reusable service method and API exposure. |
+
+## Differentiators
+
+Features that set this platform apart from basic question banks. These create the "answer -> calibrate -> precise exam -> re-answer" closed loop.
+
+| # | Feature | Value Proposition | Complexity | Notes |
+|---|---------|-------------------|------------|-------|
+| D-1 | **Mastery-based difficulty category auto-recommendation** | Currently `difficulty_category` (0-4) is passed from the external API call with no validation against student level. ALEKS achieves 90%+ success on "ready-to-learn" concepts by matching difficulty to learner state. Auto-recommending category from mastery eliminates the teacher having to guess which tier to assign. | Medium | New `DifficultyCategoryRecommender` service. Maps mastery (from `MasteryCalculator`, already per-knowledge-point) to category via thresholds. Must handle multi-kp exams: weight weaker knowledge points more heavily. Depends on TS-3 (calibrated difficulty in assembly). |
+| D-2 | **Shadow mode comparison logging** | Before fully activating calibrated difficulty, run both paths (raw vs calibrated) in parallel and log comparisons. This is standard in production ML systems. Reveals real-world impact of calibration without risking exam quality. | Low | Log raw vs calibrated average difficulty, delta distribution, coverage percentage. No behavioral change. Can activate immediately after TS-1 passes. |
+| D-3 | **Calibration health monitoring dashboard** | The existing `getHealthScaleForType()` monitors Brier/log-loss delta and auto-scales step size (0.45-1.0x), but this data is only in logs. A dashboard showing: overall health scale, per-type Brier trend, coverage percentage, drift alerts, top outlier questions. IXL's "Trouble Spots" report is analogous for student-level analytics. | Medium | New `calibration_health_snapshots` table (daily aggregation). Scheduled artisan command to compute daily. Admin panel to visualize. The computation logic already exists in `getHealthScaleForType()`. |
+| D-4 | **Calibration drift detection and alerting** | Over time, question difficulty can shift (curriculum changes, student population changes). The existing algorithm has time decay (45-day half-life) but no explicit drift detection. Detect when calibrated values systematically diverge from recent observations. | Medium | Compare rolling 7-day vs 30-day average Brier score. If 7-day is significantly worse (>20% increase), flag as drift. The data is already in `algorithm_meta.recent_events`. Needs: scheduled check, notification mechanism. |
+| D-5 | **Outlier quarantine with human review** | When a question's calibrated difficulty differs from original by more than a threshold (e.g., delta > 0.30), quarantine it for human review rather than silently using the extreme value. This catches edge cases: wrong answer keys, ambiguous questions, data entry errors. | Low | `CalibrationVerificationGate` checks delta after each calibration update. Flagged questions still get calibrated values but are marked for review. Admin UI shows quarantine list. |
+| D-6 | **Per-knowledge-point calibration accuracy breakdown** | The existing analyzer computes Pearson correlation globally. Breaking this down by knowledge point reveals where calibration works well and where it fails. Some knowledge points may have too few questions or too little variance for calibration to be meaningful. | Medium | Extend `QuestionDifficultyCalibrationAnalyzer` to group by `kp_code`. Requires joining `questions` -> `knowledge_points` which already exists. Surface in CLI report and API. |
+| D-7 | **Student-facing difficulty confidence indicator** | Show students a visual indicator of how well-matched the exam is to their level. Analogous to Khan Academy's mastery progress bars. "This exam is calibrated for your current level" vs "This exam covers new difficulty territory." | Low | Simple badge based on: was calibrated difficulty used? (difficulty_source='calibrated') AND coverage > 60%? Client-side rendering, backend just provides metadata. |
+
+## Anti-Features
+
+Features to explicitly NOT build. These would harm the system or waste effort.
+
+| # | Anti-Feature | Why Avoid | What to Do Instead |
+|---|-------------|-----------|-------------------|
+| AF-1 | **Write calibrated values back to `questions.difficulty`** | Destroys the immutable original reference value. Makes debugging impossible. Violates the explicit project constraint. Creates data integrity risk -- if calibration goes wrong, the original is gone. | Keep dual-table design. `questions.difficulty` is append-only. `question_difficulty_calibrations` is the overlay. `QuestionDifficultyResolver` handles priority. |
+| AF-2 | **Full IRT 3PL model with discrimination and guessing parameters** | The system has a working algorithm (`stratified_residual_eb_v2`) that was designed for this specific use case. Switching to 3PL would require: (a) reimplementing the entire calibration engine, (b) much more data per question to estimate 3 parameters, (c) guessing parameter is meaningless for K12 math (students rarely guess systematically). | Validate the existing algorithm first (TS-1). Only consider algorithm changes if backtesting reveals systematic failure. The existing stratified baseline + residual + Bayesian shrinkage is well-suited. |
+| AF-3 | **Real-time adaptive testing (CAT) within a single exam** | CAT requires: calibrated item pool (we are building this), real-time ability estimation during exam, item selection algorithm, exposure control. This is a fundamentally different product (computer-based testing) from the current paper/worksheet assembly model. Massive scope expansion. | Focus on accurate pre-exam difficulty matching. The exam is assembled once, not adapted mid-session. The "adaptive" element is between exams (mastery changes -> different difficulty category next time). |
+| AF-4 | **Global difficulty category per student** | A student may be category 2 (intermediate) in algebra but category 0 (zero-foundation) in geometry. Assigning one global category produces mismatched exams -- too hard in geometry, too easy in algebra. | Per-knowledge-point mastery -> per-knowledge-point difficulty recommendation (D-1). Aggregate with weakness weighting for multi-kp exams. |
+| AF-5 | **Calibration without minimum sample size enforcement** | Calibrating a question based on 1-2 answers produces wildly unreliable estimates. The algorithm already has `SHRINKAGE_M0_MIN = 8.0` as a prior strength, but the assembly path should not use calibrated values for questions below a sample threshold. | Enforce minimum effective attempts (e.g., >= 5) before using calibrated value in assembly. Below threshold, fall back to normalized original. The `QuestionDifficultyResolver` already has the data to implement this. |
+| AF-6 | **Automatic algorithm parameter tuning** | Automatically adjusting `alpha`, `max_step`, `half_life_days` based on backtest results. Sounds appealing but risks overfitting to historical data and removing human oversight from a critical algorithmic decision. | Provide backtest reports with parameter sensitivity analysis (run backtest at 3-5 parameter settings). Let humans make the final tuning decision. |
+| AF-7 | **Student-level difficulty prediction (this student will get this question wrong)** | The calibration system predicts population-level difficulty (what fraction of students will answer incorrectly). Individual prediction requires a student ability model (like IRT theta), which is a separate system. Conflating the two produces unreliable results. | Use mastery (from `MasteryCalculator`) for individual-level assessment. Use calibration for question-level difficulty estimation. Combine them at exam assembly time, not at prediction time. |
+
+## Feature Dependencies
+
+```
+TS-1 (Backtesting)
+  |
+  +---> TS-3 (Calibrated difficulty in assembly) [requires PASS gate]
+  |       |
+  |       +---> TS-4 (Distribution enabled by default) [safe after calibrated assembly]
+  |       |
+  |       +---> D-1 (Auto difficulty category) [needs calibrated assembly working]
+  |       |
+  |       +---> D-7 (Student confidence indicator) [needs calibrated assembly]
+  |
+  +---> D-2 (Shadow mode) [can run in parallel with TS-3]
+
+TS-2 (Scale normalization)
+  |
+  +---> TS-3 (Calibrated difficulty in assembly) [needs consistent scale]
+
+TS-5 (Coverage metrics)  [independent]
+TS-6 (Confidence indicator) [independent, uses existing data]
+
+D-3 (Health dashboard) ---> D-4 (Drift detection) [dashboard provides baseline]
+D-5 (Outlier quarantine) [independent, can build anytime]
+D-6 (Per-kp breakdown) [independent, extends existing analyzer]
+```
+
+Critical path: **TS-1 -> TS-2 -> TS-3 -> TS-4 -> D-1**
+
+This is the minimum path to the closed loop described in the project vision ("answer -> calibrate -> precise exam -> re-answer").
+
+## MVP Recommendation
+
+**Phase 1 -- Validation and Foundation (must ship first)**
+1. TS-1: Historical data backtesting -- proves the algorithm works
+2. TS-2: Difficulty scale normalization -- fixes the correctness bug
+3. TS-5: Calibration coverage metrics -- shows what percentage is calibrated
+4. D-2: Shadow mode logging -- reveals real-world calibration impact
+
+**Phase 2 -- Production Integration (ship after Phase 1 PASS)**
+5. TS-3: Wire calibrated difficulty into assembly
+6. TS-4: Enable difficulty distribution by default
+7. TS-6: Per-question confidence indicator
+8. AF-5 enforcement: Minimum sample size before using calibrated value
+
+**Phase 3 -- Adaptive Intelligence (ship after Phase 2 stable)**
+9. D-1: Mastery-based difficulty category auto-recommendation
+10. D-5: Outlier quarantine with human review
+11. D-7: Student-facing confidence indicator
+
+**Defer:**
+- D-3 (Health dashboard): Valuable but not on the critical path. Can be built incrementally. The existing `getHealthScaleForType()` provides the raw data; a dashboard is a presentation layer.
+- D-4 (Drift detection): Requires accumulated health snapshots. Needs D-3 running for at least 30 days first.
+- D-6 (Per-kp breakdown): Nice-to-have analytics. Not blocking any other feature.
+- AF-6 (Auto parameter tuning): Explicitly out of scope. Human decision.
+
+## Competitive Feature Comparison
+
+How this system's features compare to established platforms:
+
+| Feature | ALEKS | IXL | Khan Academy | This System |
+|---------|-------|-----|-------------|-------------|
+| Difficulty calibration method | Knowledge Space Theory (probabilistic knowledge states) | Proprietary adaptive algorithm | IRT-based mastery | Stratified residual empirical Bayes |
+| Validation approach | Billions of data points, research papers | Real-time diagnostic validation | A/B testing, mastery prediction accuracy | **Backtesting (to build)** |
+| Difficulty granularity | Per-topic per-student | Per-skill per-student | Per-exercise per-student | Per-question (calibrated), per-kp per-student (mastery) |
+| Adaptive exam assembly | Full CAT (selects next item based on current ability estimate) | Adaptive practice (not exam) | Mastery-based progression | Static assembly with difficulty distribution (pre-exam) |
+| Health monitoring | Implicit (built into KST algorithm) | Real-Time Diagnostic, Trouble Spots | Internal accuracy metrics | **Health scale (exists), Dashboard (to build)** |
+| Student-facing feedback | "Ready to learn" / "Not ready" indicators | Diagnostic strand analysis, progress reports | Mastery progress bars, streak counters | **Calibration badge (to build)** |
+| Admin/teacher analytics | Learning progress, knowledge pie chart | Live Classroom, Trouble Spots, reports | Coach dashboard, assignment analytics | **CLI report (exists), Dashboard (to build)** |
+
+Key insight: This system's unique advantage is the dual-difficulty model (original + calibrated overlay) with an explicit validation gate. No other platform exposes calibration confidence or separates original vs calibrated difficulty. This transparency is a differentiator for institutional customers who need to audit and trust the algorithm.
+
+## Sources
+
+- Direct codebase analysis: `QuestionDifficultyCalibrationService` (974 lines), `QuestionDifficultyCalibrationAnalyzer`, `QuestionDifficultyResolver`, `DifficultyDistributionService`, `MasteryCalculator`, `IntelligentExamController`, `ExamAnswerAnalysisService`, `LearningAnalyticsService` (HIGH confidence)
+- ALEKS: Knowledge Space Theory, billions of data points, 90%+ success rate for ready-to-learn concepts (web research, MEDIUM confidence)
+- IXL: Real-Time Diagnostic, Diagnostic Strand Analysis, Trouble Spots, Live Classroom reports (web research, MEDIUM confidence)
+- Khan Academy: Mastery learning system, progress tracking, internal accuracy validation (training data, MEDIUM confidence)
+- IRT validation methodology: Brier score, log-loss, Pearson correlation for item parameter validation (training data + Wikipedia, HIGH confidence -- standard psychometrics)
+- ARCHITECTURE.md research file: Component boundaries, data flows, build order (HIGH confidence -- derived from codebase)

+ 387 - 0
.planning/research/PITFALLS.md

@@ -0,0 +1,387 @@
+# Domain Pitfalls: K12 Math Difficulty Calibration & Intelligent Exam
+
+**Domain:** Educational assessment -- difficulty calibration validation, pipeline wiring, adaptive exam assembly
+**Researched:** 2026-04-16
+**Confidence:** HIGH (codebase analysis + established IRT/calibration domain knowledge)
+
+---
+
+## Critical Pitfalls
+
+Mistakes that cause rewrites, invalid results, or systemic failure in production.
+
+### Pitfall 1: Validating Calibration Against the Same Data That Produced It (Circular Validation)
+
+**What goes wrong:** Running `recalibrateQuestionIds()` on historical data, then computing Brier score / error rate against those same paper_questions rows, produces artificially good metrics. The algorithm was trained on this data; of course it fits.
+
+**Why it happens:** The existing `AnalyzeQuestionDifficultyCalibrationCommand` reports `calibration_gap = bank_difficulty_normalized - empirical_error_rate`. This measures agreement between original difficulty and observed rates, NOT whether the calibrated difficulty predicts unseen data. The existing report gives a false sense of validation.
+
+**Consequences:** Calibrated difficulty goes into production, but on new student data the predictions are no better (or worse) than the original difficulty. The entire calibration effort is wasted, and worse, it introduces noise into the exam assembly pipeline.
+
+**Prevention:**
+- Split historical data temporally: calibrate on data before a cutoff date, validate Brier score / log-loss on data after the cutoff.
+- Use walk-forward validation: for each month, calibrate using all prior data, predict the next month's outcomes.
+- The metric that matters is "does calibrated difficulty predict out-of-sample error rate better than original difficulty?" -- not in-sample fit.
+
+**Warning signs:**
+- Validation Brier score is suspiciously good (below 0.15 for a heterogeneous K12 population).
+- No mention of train/test split in the validation plan.
+- Validation is done by running the existing report command with different flags on the same dataset.
+
+**Phase:** Phase 1 (Calibration Validation) -- this is the single most important thing to get right.
+
+---
+
+### Pitfall 2: Dual Difficulty Standard (0-1 vs 0-5) Silent Corruption
+
+**What goes wrong:** The codebase has two difficulty scales. `questions.difficulty` stores raw values that may be 0-1 OR 0-5 depending on when and how the question was created. The `normalizeDifficultyValue()` method in `QuestionDifficultyCalibrationService` tries to detect this (line 948: `if ($raw > 1.0) { $raw = $raw / 5.0; }`), but this heuristic is fragile and wrong in specific cases:
+
+- A difficulty of exactly 1.0 is ambiguous: is it "maximum on 0-1 scale" or "minimum on 0-5 scale"? The code treats it as 0-1, which may be wrong.
+- A difficulty of 0.6 on the 0-5 scale would NOT trigger normalization (0.6 < 1.0), but the true normalized value should be 0.12.
+- The 0-5 values are not always integers -- some may be 2.5, 3.7, etc., making the `> 1.0` heuristic unreliable for distinguishing scales.
+
+**Why it happens:** Legacy data entry used different conventions at different times. No migration normalized the raw values when the system was built.
+
+**Consequences:**
+- Questions with 0-5 raw difficulty that happen to be < 1.0 get treated as if they are on the 0-1 scale. A question rated "2 out of 5" (should normalize to 0.4) stays at 0.2 if it was entered as a float.
+- Calibration baselines computed from mixed-scale data are meaningless because the "original difficulty" column contains inconsistent scales.
+- DifficultyDistributionService bucket boundaries (0.25, 0.5, 0.75) misclassify questions.
+
+**Prevention:**
+- Before ANY validation or calibration, audit the `questions.difficulty` distribution. If there is a bimodal distribution (cluster near 0-1 AND cluster near 0-5), the dual-standard problem exists.
+- Add a `difficulty_scale` column or flag to `questions` table indicating which scale was used, OR do a one-time migration to normalize all values to 0-1.
+- Add a guard in `normalizeDifficultyValue()`: if the question was created before a known cutoff date, assume 0-5 scale.
+- Log when normalization triggers so you can count how many questions hit the ambiguous zone.
+
+**Warning signs:**
+- Distribution of `questions.difficulty` shows values like 2.0, 3.5, 4.0 (clearly 0-5 scale) mixed with 0.1, 0.5, 0.8.
+- Calibrated difficulty delta is systematically positive or negative for a batch of questions (suggesting original difficulty was systematically wrong).
+- `DifficultyDistributionService` bucket counts look wrong: almost all questions in one bucket.
+
+**Phase:** Phase 1 (must resolve BEFORE validation can be trusted).
+
+---
+
+### Pitfall 3: Low Sample Size Questions Dominating Calibration Results
+
+**What goes wrong:** The algorithm has a minimum weighted_attempts threshold of 8 for batch mode (line 658) and uses Bayesian shrinkage, but the online mode has NO minimum -- it always updates. Questions with 1-2 responses get calibrated difficulty that is essentially random noise, but the shrinkage prior (original difficulty) may itself be wrong (see Pitfall 2). The system then treats this calibrated value as authoritative.
+
+**Why it happens:** Each new grading event triggers `updateOnlineFromPaper()`, which updates calibrated difficulty for every question on that paper regardless of sample size. The adaptive step limiting (`maxStep`) mitigates large jumps, but cumulative small biases from low-sample updates still accumulate.
+
+**Consequences:**
+- A question answered correctly once by a strong student gets a calibrated difficulty lower than it should be.
+- When this question is then used in exam assembly, it may be classified in the wrong difficulty bucket.
+- If the student later answers it incorrectly, the difficulty swings back. Oscillation without convergence.
+
+**Prevention:**
+- Add a `sample_confidence` field to the calibration table: `sqrt(weighted_attempts / 80)`. Use this downstream.
+- In `QuestionDifficultyResolver::applyCalibratedDifficulty()`, only apply calibrated difficulty when `weighted_attempts >= min_threshold` (suggest 15-20 for production use). Fall back to original difficulty otherwise.
+- Track the percentage of questions in the pool that have sufficient calibration data. If < 30%, the calibration system cannot be trusted for exam assembly.
+- Report sample size distribution alongside calibration metrics.
+
+**Warning signs:**
+- Median `weighted_attempts` across calibrated questions is below 10.
+- Large fraction (>40%) of questions in the calibration table have `attempts < 5`.
+- Difficulty delta distribution has heavy tails (many questions moving by > 0.2).
+
+**Phase:** Phase 1 (validation must include sample size audit), Phase 2 (resolver must enforce minimum sample threshold).
+
+---
+
+### Pitfall 4: Stratified Baseline Computed From the Same Population Being Calibrated
+
+**What goes wrong:** `buildGlobalBaselines()` computes expected error rates by question_type x difficulty_category from ALL paper_questions data. Then `estimateByStratifiedResidual()` computes residual = observed_error_rate - baseline_error_rate for individual questions. But the baseline was computed FROM those same questions' responses. This creates a regression-to-the-mean artifact:
+
+- Questions whose observed error rate contributed to a high baseline will, by construction, tend to have smaller residuals than they "should."
+- If a question_type has few questions, one question's outliers heavily influence the baseline for that type, biasing residuals for all questions of that type.
+
+**Why it happens:** Proper IRT uses student ability estimates as a conditioning variable (given student ability of theta, what is the probability of error?). This system uses difficulty_category as a proxy for student ability, but difficulty_category is assigned to the PAPER, not the student, and is itself potentially miscalibrated.
+
+**Consequences:**
+- Residuals are biased toward zero for questions in strata with few observations.
+- The calibration algorithm systematically under-corrects for the most common question types and over-corrects for rare types.
+- In extreme cases, a question that every student gets wrong but that is in a "hard" difficulty_category gets a residual near zero (because the baseline already expects high error rate), so its difficulty is not adjusted upward.
+
+**Prevention:**
+- Compute baselines using leave-one-out: for each question, exclude that question's responses from the baseline computation.
+- Alternatively, validate whether the difficulty_category of papers is itself well-calibrated before relying on it for stratification.
+- Report the number of observations per stratum. If any stratum has < 100 observations, flag it as unreliable.
+- Consider whether using student mastery level (already computed by MasteryCalculator) would be a better stratification variable than paper difficulty_category.
+
+**Warning signs:**
+- Baseline error rates for some strata are based on fewer than 50 observations.
+- Calibration delta is near zero for most questions (suggesting baselines are absorbing all the signal).
+- Residual distribution is symmetric and centered at zero even for questions where subject-matter experts know the difficulty is wrong.
+
+**Phase:** Phase 1 (audit baseline computation), Phase 2 (consider alternative stratification).
+
+---
+
+### Pitfall 5: Wiring Calibrated Difficulty Into Exam Assembly Without A/B Testing
+
+**What goes wrong:** After validation, flipping `enable_difficulty_distribution` to true and having `QuestionDifficultyResolver` override all difficulty values in one deployment. If calibration is systematically biased in any direction, ALL exams immediately get worse.
+
+**Why it happens:** The existing code path in `LearningAnalyticsService` has `enable_difficulty_distribution` as a boolean flag checked per-request. The natural "fix" is to set it to true globally. But the code has multiple assembly paths (diagnostic, practice, mistake, textbook, knowledge_points), each with different fallback behavior.
+
+**Consequences:**
+- If calibrated difficulty is systematically 0.05 lower than true difficulty, all exams become slightly too hard. Students struggle, completion rates drop, and the feedback loop (calibration absorbs these results) reinforces the bias.
+- Some assembly paths may not use `QuestionDifficultyResolver` at all, creating inconsistency between exam types.
+
+**Prevention:**
+- Roll out calibrated difficulty to ONE assembly type first (e.g., practice) with a feature flag.
+- Run A/B comparison: for identical student/context parameters, generate exams with original vs calibrated difficulty and compare outcome distributions.
+- Add logging: for every exam assembled, log which difficulty source was used and the difficulty distribution of selected questions.
+- Monitor exam outcome metrics after rollout: average score, completion rate, time-per-question. If any metric degrades by > 5%, auto-revert.
+
+**Warning signs:**
+- No per-assembly-type rollout plan.
+- No logging of `difficulty_source` in assembled exam metadata.
+- `enable_difficulty_distribution` is a single boolean controlling all paths.
+
+**Phase:** Phase 2 (pipeline wiring), requires feature flags and monitoring.
+
+---
+
+### Pitfall 6: Boundary Effects in Difficulty Distribution Buckets
+
+**What goes wrong:** `DifficultyDistributionService::classifyQuestionByDifficulty()` uses strict boundary comparisons. A question with difficulty exactly 0.25 falls into different buckets depending on category:
+- Category 2: `difficulty >= 0.25 && difficulty <= 0.5` goes to `primary_medium`
+- Category 1: `difficulty >= 0 && difficulty <= 0.25` goes to `primary_medium`
+
+But a question at 0.25001 goes to `primary_high` for category 2. This means tiny calibration changes (0.001) can shift questions between buckets, causing the assembled exam's difficulty profile to be dramatically different.
+
+**Why it happens:** Boundary values are hard-coded without hysteresis or soft boundaries. The calibrated difficulty is stored with 4 decimal places, making boundary crossings likely.
+
+**Consequences:**
+- A 0.001 change in calibrated difficulty can move a question from "primary" to "other" bucket.
+- Exams assembled right after a calibration run may have very different difficulty profiles than exams assembled right before, even for the same student.
+- The `getSupplementOrder()` fallback logic kicks in differently depending on boundary crossings, introducing non-deterministic exam composition.
+
+**Prevention:**
+- Add margin/overlap to bucket boundaries. A question at 0.24-0.26 should be eligible for BOTH adjacent buckets, with a probability proportional to distance from boundary center.
+- Alternatively, when selecting questions for a bucket, include questions within a buffer zone (e.g., +/- 0.03 from the boundary) and randomly select from the expanded pool.
+- Log bucket population counts before and after calibration to detect boundary-driven shifts.
+
+**Warning signs:**
+- Before/after calibration, the same question pool produces exams with noticeably different difficulty distributions.
+- Many questions have difficulty values clustering at bucket boundaries (0.25, 0.5, 0.75).
+- `groupQuestionsByDifficultyRange()` returns very uneven bucket sizes.
+
+**Phase:** Phase 2 (when wiring distribution service), Phase 3 (when tuning difficulty_category recommendation).
+
+---
+
+### Pitfall 7: Time Decay Creating Recency Bias in Calibration
+
+**What goes wrong:** The 45-day half-life decay means responses older than ~6 months contribute < 3% weight. For K12 math, this creates a systematic bias: questions that were easier in earlier grades (when students were learning the concept) appear harder than they really are for the current cohort, because the only remaining data is from students who struggled recently.
+
+**Why it happens:** The algorithm is designed for "dynamic" difficulty that responds to recent trends. But K12 math content has strong grade-level alignment -- a question appropriate for Grade 7 will ALWAYS be answered by Grade 7 students. The time decay doesn't differentiate between "the question got harder" and "we're seeing different students."
+
+**Consequences:**
+- At the start of a new semester, calibration data is sparse (few recent responses), so calibrated difficulty reverts toward original (potentially wrong) values.
+- Questions used primarily for review (appearing in diagnostic/review exams) have time-decayed away most of their easy responses, making them appear harder than they are.
+- Seasonal patterns (exam periods vs. vacation) create oscillation in calibrated difficulty.
+
+**Prevention:**
+- Consider whether time decay is appropriate at all for K12 content. The underlying difficulty of a math question does not change over time in the way that, say, a sports prediction model's inputs would.
+- If keeping time decay, extend the half-life to 120-180 days to span a full semester.
+- Add a minimum data window: only apply decay-adjusted calibration if there are at least N responses within the decay window. Otherwise, use the full-history estimate.
+- Monitor calibrated difficulty drift over time. If difficulty trends correlate with calendar time rather than with changes in student population, the decay is causing harm.
+
+**Warning signs:**
+- Calibrated difficulty for stable questions drifts systematically over the school year.
+- Questions not used in the last 2 months have calibrated difficulty reverting toward 0.5 (the prior).
+- Health scale is frequently below 0.8, indicating the algorithm itself detects instability.
+
+**Phase:** Phase 1 (validate with and without time decay to see which produces better out-of-sample predictions).
+
+---
+
+### Pitfall 8: Health Monitor Degeneracy -- Bad Predictions Cause Self-Reinforcing Caution
+
+**What goes wrong:** The `getHealthScaleForType()` method reduces the step size when recent Brier scores are worsening. But if the INITIAL calibration was wrong (bad original difficulty), then:
+1. The first predictions have high Brier scores (wrong difficulty -> wrong error rate prediction).
+2. Health monitor sees worsening and reduces `health_scale` (multiplies step by 0.78 or 0.82).
+3. With smaller steps, calibration converges slower toward the true difficulty.
+4. Brier scores remain high because convergence is slow.
+5. Health monitor further reduces step size.
+
+This creates a death spiral where the system becomes too cautious to ever self-correct.
+
+**Why it happens:** The health monitor compares `brier_after` vs `brier_before` per event. If the calibration is already wrong, both "before" and "after" are bad, but "after" can be slightly worse due to noise. The cumulative delta being positive triggers the 0.78 multiplier.
+
+**Consequences:**
+- Questions with badly wrong original difficulty get stuck near the wrong value because the health monitor prevents aggressive correction.
+- The system appears "stable" (small deltas) but is actually stuck at wrong values.
+- The health_scale cache (5-minute TTL) means a few bad events can suppress correction for an extended period.
+
+**Prevention:**
+- Add a minimum health_scale floor higher than 0.45 (current floor). Suggest 0.6 minimum.
+- Only activate health monitoring after a minimum number of events (current threshold is 80, which is reasonable, but consider requiring per-question rather than per-type).
+- Include a "reset" mechanism: if health_scale stays below 0.7 for more than 14 days, force a full recalibration from scratch (ignoring previous calibrated values).
+- Track the distribution of health_scale values. If most types are below 0.7, the system is in a degenerate state.
+
+**Warning signs:**
+- Health scale for most question types is at or near the 0.45 floor.
+- Calibrated difficulty delta (calibrated - original) distribution is narrow, suggesting the algorithm is barely adjusting anything.
+- The calibration was "validated" but exam quality metrics haven't improved.
+
+**Phase:** Phase 1 (audit health monitor behavior during validation), Phase 2 (tune parameters before production).
+
+---
+
+## Moderate Pitfalls
+
+### Pitfall 9: Online vs Batch Mode Inconsistency
+
+**What goes wrong:** The batch mode (`estimateByStratifiedResidual`) and online mode (`estimateOnlineBySingleOutcome`) use different logic:
+- Batch mode: step limit of 0.0 when weighted_attempts < 8 (no adjustment at all).
+- Online mode: always adjusts, with `maxStep = 0.30 * (0.35 + 0.65 * confidence) * healthScale`. At weighted_attempts = 1, confidence = 0.0125, giving maxStep ~ 0.11 -- a non-trivial adjustment from a single data point.
+
+A question that gets batch-recalibrated with 7 responses gets NO adjustment. The same question that gets 7 online updates gets 7 incremental adjustments. The final calibrated difficulty can differ significantly.
+
+**Prevention:** Align the minimum sample thresholds between modes. Either both modes should require 8+ weighted attempts before any adjustment, or both should allow incremental updates. Document which mode is the source of truth.
+
+**Phase:** Phase 1 (resolve during validation -- both modes should produce similar results on the same data).
+
+---
+
+### Pitfall 10: Question Pool Exhaustion When Difficulty Distribution Is Enabled
+
+**What goes wrong:** `DifficultyDistributionService` defines narrow ranges for each category. Category 0 requires 90% of questions with difficulty 0-0.1. If the question pool for a given knowledge point + question_type has very few questions in that range, the assembly either:
+1. Falls back to `getSupplementOrder()` and fills with "other" bucket questions (defeating the difficulty targeting), or
+2. Fails to assemble enough questions and returns a partial exam.
+
+**Prevention:** Before enabling difficulty distribution, analyze the question pool by knowledge point to verify sufficient coverage at each difficulty level. If coverage is sparse, either:
+- Relax the distribution percentages for sparse knowledge points.
+- Expand bucket boundaries when the pool is small.
+- Log "pool exhaustion" events and track them as a KPI.
+
+**Phase:** Phase 2 (before enabling `enable_difficulty_distribution` by default).
+
+---
+
+### Pitfall 11: Ignoring Question Type Heterogeneity in Difficulty Perception
+
+**What goes wrong:** Choice questions (multiple choice) have a baseline ~25% correct-by-guessing rate. Fill-in-the-blank and open-ended questions have no guessing bonus. The calibration algorithm treats "is_correct" the same across all types. A choice question with calibrated difficulty 0.5 does NOT have the same "true difficulty" as a fill-in question with calibrated difficulty 0.5.
+
+**Prevention:**
+- Stratify by question_type (which the system already does for baselines) but also consider applying a guessing correction to choice questions' error rates before calibration.
+- When matching difficulty to student level, account for question type: a student needs higher mastery to answer a difficulty-0.5 open-ended question than a difficulty-0.5 multiple-choice question.
+
+**Phase:** Phase 3 (when building the mastery-to-difficulty_category recommendation).
+
+---
+
+### Pitfall 12: Mastery-to-Difficulty Category Mapping Without Ground Truth
+
+**What goes wrong:** The project plans to map student mastery (0-1 continuous value from MasteryCalculator) to difficulty_category (0-4 discrete levels). Without empirical validation, the mapping will be based on intuition. Common mistakes:
+- Mapping mastery 0.8 to difficulty_category 1 (too easy for the student).
+- Using a linear mapping when the relationship between mastery and appropriate difficulty is non-linear (zone of proximal development suggests students learn best at difficulty slightly above their current mastery).
+
+**Prevention:**
+- Use historical data to find the optimal mapping: for each (mastery_level, difficulty_category) pair, what was the average score? The sweet spot is where the student gets 60-75% correct (zone of proximal development).
+- Validate the mapping by checking if exams assembled using the mapping produce the target score range (60-75%).
+- Allow the mapping to vary by grade level -- what is "appropriate challenge" differs for Grade 3 vs Grade 10.
+
+**Phase:** Phase 3 (mastery-to-difficulty recommendation).
+
+---
+
+### Pitfall 13: Calibration Feedback Loop Divergence
+
+**What goes wrong:** Once calibrated difficulty drives exam assembly, the calibration also absorbs the outcomes of those exams. If the calibration overestimates difficulty (marks questions as harder than they are), the system assigns them to higher difficulty_category exams where students are stronger. Stronger students answer correctly, causing the calibration to further lower the difficulty. The system oscillates.
+
+**Prevention:**
+- Include student ability (mastery level) as a covariate in the calibration algorithm. A question answered correctly by a high-mastery student provides different information than one answered correctly by a low-mastery student.
+- Track "calibration drift" over time: compare the distribution of calibrated difficulties at time T to time T+30d. If the distribution is shifting systematically, the feedback loop may be diverging.
+- Add a "ground truth" anchor: periodically have subject-matter experts rate a sample of questions. Compare expert ratings to calibrated values. If they diverge, increase shrinkage toward the expert prior.
+
+**Phase:** Phase 2 (monitor after wiring), Phase 3 (add student ability as covariate).
+
+---
+
+## Minor Pitfalls
+
+### Pitfall 14: Algorithm Meta JSON Bloat
+
+**What goes wrong:** The `algorithm_meta` JSON column stores `recent_events` (up to 30 events per question). With thousands of calibrated questions, this column grows rapidly. Each online update reads and re-writes the full JSON. Over months, this table becomes the largest in the database by storage, and queries slow down.
+
+**Prevention:** Move `recent_events` to a separate table (one row per event) or cap the JSON size more aggressively. Consider dropping event-level detail after 30 days and keeping only aggregated metrics.
+
+**Phase:** Phase 2 (before heavy production use of online mode).
+
+---
+
+### Pitfall 15: Race Condition in Online Calibration Updates
+
+**What goes wrong:** Two simultaneous grading events for the same question (e.g., two students submit papers at the same time) both read the same `prev_difficulty` from the calibration table, compute their updates independently, and the second write overwrites the first. The net effect is that one update is lost.
+
+The current `upsert` on `question_bank_id` is atomic at the row level, but the read-compute-write cycle in `updateOnlineFromPaper()` is NOT atomic. Between reading `existing` (line 116) and writing `upserts` (line 212), another process can update the same row.
+
+**Prevention:** Use `SELECT ... FOR UPDATE` or database-level locks when reading existing calibration data for questions that are about to be updated. Alternatively, use an incremental approach: `UPDATE ... SET weighted_attempts = weighted_attempts + X, weighted_wrong = weighted_wrong + Y` instead of computing the new values in PHP.
+
+**Phase:** Phase 2 (before production deployment of online mode at scale).
+
+---
+
+### Pitfall 16: Default Difficulty (0.5) Contaminating Calibration
+
+**What goes wrong:** Questions without a set difficulty default to 0.5 in `hydrateQuestions()` (line 723: `'difficulty' => isset($question['difficulty']) ? (float) $question['difficulty'] : 0.5`). When these questions enter calibration, `original_difficulty` becomes 0.5. With Bayesian shrinkage toward 0.5 (the Beta(2,2) prior mode), these questions' calibrated difficulty will be pulled toward 0.5 regardless of actual difficulty.
+
+**Prevention:** Distinguish between "explicitly set to 0.5" and "unset, defaulted to 0.5." Only apply shrinkage toward the prior for questions where the original difficulty was explicitly set. For unset questions, use the empirical error rate directly (with wider confidence intervals).
+
+**Phase:** Phase 1 (data audit should count questions with default difficulty).
+
+---
+
+## Phase-Specific Warnings
+
+| Phase Topic | Likely Pitfall | Mitigation | Severity |
+|-------------|---------------|------------|----------|
+| Calibration validation | Circular validation (Pitfall 1) | Temporal train/test split | CRITICAL |
+| Data audit | Dual difficulty standard (Pitfall 2) | One-time normalization + flag column | CRITICAL |
+| Data audit | Low sample size (Pitfall 3) | Report sample distribution; set minimum threshold | CRITICAL |
+| Algorithm audit | Baseline self-reference (Pitfall 4) | Leave-one-out baselines; audit stratum sizes | HIGH |
+| Algorithm audit | Time decay appropriateness (Pitfall 7) | Compare with/without decay in validation | HIGH |
+| Algorithm audit | Health monitor degeneracy (Pitfall 8) | Raise floor to 0.6; add reset mechanism | HIGH |
+| Pipeline wiring | No A/B testing (Pitfall 5) | Feature flag per assembly type; monitor metrics | CRITICAL |
+| Pipeline wiring | Boundary effects (Pitfall 6) | Soft boundaries or buffer zones | HIGH |
+| Pipeline wiring | Mode inconsistency (Pitfall 9) | Align thresholds between batch and online | MEDIUM |
+| Pipeline wiring | Pool exhaustion (Pitfall 10) | Pre-analyze coverage; log exhaustion events | MEDIUM |
+| Pipeline wiring | Race condition (Pitfall 15) | Row-level locks or incremental updates | MEDIUM |
+| Difficulty recommendation | Type heterogeneity (Pitfall 11) | Guessing correction for choice questions | MEDIUM |
+| Difficulty recommendation | Mastery mapping without ground truth (Pitfall 12) | Use historical data to find optimal mapping | HIGH |
+| Ongoing operations | Feedback loop divergence (Pitfall 13) | Track drift; periodic expert anchoring | HIGH |
+| Ongoing operations | JSON bloat (Pitfall 14) | Separate events table or aggressive capping | LOW |
+| Data quality | Default 0.5 contamination (Pitfall 16) | Distinguish set vs unset difficulty | LOW |
+
+## Validation Checklist Before Each Phase
+
+### Before Phase 1 (Calibration Validation):
+- [ ] Audit `questions.difficulty` distribution for dual-standard evidence
+- [ ] Count questions with default difficulty (0.5) vs explicitly set
+- [ ] Check sample size distribution: what % of questions have >= 10 responses?
+- [ ] Define temporal split point for validation
+- [ ] Decide: validate with or without time decay? (Test both)
+
+### Before Phase 2 (Pipeline Wiring):
+- [ ] Verify calibration improved out-of-sample prediction (Phase 1 output)
+- [ ] Set minimum weighted_attempts threshold for QuestionDifficultyResolver
+- [ ] Implement per-assembly-type feature flag
+- [ ] Add difficulty_source logging to all assembly paths
+- [ ] Analyze question pool coverage by difficulty bucket per knowledge point
+
+### Before Phase 3 (Mastery-to-Difficulty Recommendation):
+- [ ] Collect empirical data: for each (mastery_quintile, difficulty_category), what is average score?
+- [ ] Identify zone of proximal development: which difficulty_category produces 60-75% correct for each mastery level?
+- [ ] Check for question type interaction: does the optimal mapping differ for choice vs fill vs open-ended?
+
+## Sources
+
+- Codebase analysis: `QuestionDifficultyCalibrationService.php`, `QuestionDifficultyResolver.php`, `DifficultyDistributionService.php`, `IntelligentExamController.php`, `LearningAnalyticsService.php`
+- IRT/correspondence theory: Lord & Novick "Statistical Theories of Mental Test Scores" (foundational work on item calibration and Bayesian estimation)
+- Adaptive testing design: Wainer "Computerized Adaptive Testing: A Primer" (pitfalls of item pool coverage and difficulty targeting)
+- Zone of proximal development in adaptive systems: Vygotsky-based calibration targets in K12 systems are widely discussed in educational measurement literature
+- Bayesian shrinkage in item calibration: Mislevy & Bock (IRT parameter estimation with informative priors)
+- Confidence level: HIGH for codebase-specific pitfalls (directly observed in code). HIGH for domain pitfalls (IRT/calibration theory is well-established).

+ 236 - 0
.planning/research/STACK.md

@@ -0,0 +1,236 @@
+# Technology Stack
+
+**Project:** Math CMS — Difficulty Calibration & Intelligent Exam
+**Researched:** 2026-04-16
+
+## Recommended Stack
+
+### Core Framework (existing — no changes)
+
+| Technology | Version | Purpose | Why |
+|------------|---------|---------|-----|
+| Laravel Framework | ^12.0 | Backend application framework | Already in production, entire codebase built on it. No reason to change. |
+| PHP | ^8.2 | Runtime | Already in production. Supports enums, readonly properties, named arguments, fibers — all useful for the statistical code. |
+| MySQL | existing | Primary database | Stores `questions`, `papers`, `paper_questions`, `question_difficulty_calibrations`, `mistake_records`. All calibration data lives here. |
+| Redis + Predis | existing | Cache + queue | Used for baseline caching (`$this->baselineCache`), async job queue for `AssembleExamTaskJob`. |
+
+### Statistical Validation — Primary Addition
+
+| Technology | Version | Purpose | Why |
+|------------|---------|---------|-----|
+| **markrogoyski/math-php** | ^2.13 | Pure-PHP math/statistics library | **The only viable PHP-native option.** Provides correlation, significance testing, probability distributions (Beta, Normal, Binomial, Student's t), descriptive statistics, and ANOVA. No external dependencies. Required for Brier score decomposition, confidence intervals, and backtest validation. |
+
+**Confidence: HIGH** — Verified on Packagist (v2.13.0, actively maintained, 3k+ GitHub stars, pure PHP, no C extensions).
+
+#### What math-php provides that the project needs
+
+1. **Correlation & Significance Testing**
+   - `Correlation::pearson()` — validate calibrated difficulty vs empirical error rate (the analyzer already has a custom Pearson implementation; math-php's version is battle-tested and produces p-values)
+   - `Significance::rCritical()` / `Significance::tpValue()` — determine whether observed correlations are statistically significant, not just numerically large
+   - `Correlation::spearman()` — rank-based correlation, more robust for ordinal difficulty categories (0-4) than Pearson
+
+2. **Descriptive Statistics**
+   - `Descriptive::standardDeviation()`, `Descriptive::mean()`, `Descriptive::median()`, `Descriptive::interquartileRange()` — for bin analysis and distribution sanity checks
+   - `Descriptive::coefficientOfVariation()` — compare calibration stability across question types
+
+3. **Probability Distributions**
+   - `BetaDistribution` — directly supports the Beta(2,2) prior used in `stratified_residual_eb_v2`; enables computing credible intervals around calibrated difficulty
+   - `NormalDistribution` — for confidence interval construction around error rates
+   - `BinomialDistribution` — for computing exact probability of observed correct/wrong counts given hypothesized difficulty
+
+4. **ANOVA**
+   - `ANOVA::oneWay()` — test whether calibration deltas differ significantly across question types or difficulty categories
+
+5. **Regression**
+   - `LinearRegression::create()` — for fitting calibration quality over time (is the system getting better?)
+
+### Testing Infrastructure (existing — extend)
+
+| Technology | Version | Purpose | Why |
+|------------|---------|---------|-----|
+| PHPUnit | ^11.5.3 | Test framework | Already in `require-dev`. Extend with calibration validation tests. |
+| Mockery | ^1.6 | Test mocking | Already in `require-dev`. For mocking DB queries in unit tests. |
+
+### Backtesting Infrastructure (new — build internally)
+
+| Technology | Version | Purpose | Why |
+|------------|---------|---------|-----|
+| Custom Artisan Command | new | `questions:difficulty-backtest` | Historical replay of calibration algorithm. Split answer data chronologically, run algorithm on first N%, measure prediction accuracy on remaining data. No external library — this is domain-specific and should leverage existing `QuestionDifficultyCalibrationAnalyzer`. |
+| Custom PHPUnit Test Suite | new | `tests/Feature/DifficultyCalibrationBacktestTest.php` | Automated regression tests for calibration accuracy. Run in CI to catch algorithm regressions. |
+
+## Alternatives Considered
+
+### Statistical Libraries
+
+| Category | Recommended | Alternative | Why Not |
+|----------|-------------|-------------|---------|
+| PHP math library | math-php ^2.13 | Write custom implementations | The codebase already has custom `pearsonCorrelation()` in the analyzer. math-php is more reliable, handles edge cases (zero-division, small samples), and provides p-values. Rewriting statistical functions is error-prone and wastes time. |
+| PHP math library | math-php ^2.13 | `phpscience/statistics` | Stale, fewer features, lower community adoption. math-php has 10x the feature coverage. |
+| PHP math library | math-php ^2.13 | Call Python/R via shell | Adds runtime dependency on external language, introduces serialization overhead, deployment complexity. The math needed here is not computationally heavy — pure PHP is sufficient. |
+| PHP math library | math-php ^2.13 | Call external stats API | Unnecessary network dependency for what are fundamentally simple statistical computations. Adds latency and failure modes. |
+
+### IRT / CAT Libraries
+
+| Category | Decision | Alternative | Why Not |
+|----------|----------|-------------|---------|
+| Full IRT framework | **Do not adopt** | `irt` R package, `py-irt` Python | The project uses a custom `stratified_residual_eb_v2` algorithm, not standard IRT. Adopting a full IRT framework would mean rewriting the calibration pipeline. Current algorithm works — validate it, don't replace it. |
+| CAT engine | **Do not adopt now** | `catsim` Python, `mirtCAT` R | The exam assembly pipeline (`IntelligentExamController`) already has a working strategy-based approach. CAT requires a fundamentally different architecture (sequential item selection with real-time ability estimation). This is a future consideration, not current scope. |
+| Adaptive testing | **Defer to Phase 3+** | Standard CAT algorithms | The current priority is validating calibration and wiring it into the existing assembly pipeline. Adaptive testing is the natural evolution but depends on validated calibration first. |
+
+### Backtesting Approaches
+
+| Category | Recommended | Alternative | Why Not |
+|----------|-------------|-------------|---------|
+| Chronological split | Build in-house | k-fold cross-validation | Time-series data (student answers) has temporal dependencies. Random shuffling leaks future information. Chronological split respects the time-dependent nature of calibration updates. |
+| Walk-forward validation | Build in-house | Single train/test split | Walk-forward better simulates the real online update pattern (`updateOnlineFromPaper`). The algorithm updates incrementally per grading event — a single split misses this dynamics. |
+| Brier Score as primary metric | Use existing + extend | MSE / RMSE | Brier score is already implemented in the calibration service (`buildUpdateEvent`). It is the proper scoring rule for probabilistic predictions (difficulty = predicted error probability). MSE treats difficulty as a point estimate, losing the probabilistic interpretation. |
+| Brier Score decomposition | Use math-php | Manual calculation | Decomposition into Uncertainty + Reliability + Resolution requires non-trivial binning and statistics. math-php provides the building blocks; build the decomposition logic on top. |
+
+## Installation
+
+```bash
+# Core statistical library — only new dependency
+composer require markrogoyski/math-php:^2.13
+
+# Dev dependencies (already installed, no changes needed)
+# phpunit/phpunit ^11.5.3
+# mockery/mockery ^1.6
+```
+
+## How Each Research Question Is Addressed
+
+### Q1: Statistical Validation Frameworks for IRT and Difficulty Calibration
+
+**Recommendation: Use math-php for validation metrics, not a full IRT framework.**
+
+The existing `stratified_residual_eb_v2` algorithm is a proprietary hybrid (not standard 1PL/2PL/3PL IRT). Validation should focus on:
+
+1. **Calibration Accuracy Metrics** (all computable with math-php + existing code):
+   - **Brier Score** — already implemented in `buildUpdateEvent()`. Extend with decomposition: Brier = Uncertainty - Resolution + Reliability. Lower reliability = better calibrated. Higher resolution = more discriminating.
+   - **Pearson correlation** (calibrated difficulty vs empirical error rate) — already in analyzer. Add p-value via math-php `Significance::tpValue()`.
+   - **Spearman rank correlation** — add via math-php. More appropriate for the 5-level difficulty categories.
+   - **Calibration-in-the-large** — compare mean predicted difficulty to mean observed error rate. Simple but catches systematic bias.
+   - **Calibration curves** — bin predictions into deciles, plot predicted vs observed error rate. Visual diagnostic built into the backtest command.
+
+2. **Confidence Intervals** (new, using math-php):
+   - Beta posterior credible interval around each calibrated difficulty value — `BetaDistribution` with parameters `(weighted_wrong + 2, weighted_attempts - weighted_wrong + 2)`
+   - If the credible interval is wide (low sample), the calibration should carry a confidence flag that the exam assembly pipeline can use to fall back to original difficulty
+
+3. **Health Monitoring** (already partially implemented):
+   - `getHealthScaleForType()` monitors recent Brier score and log-loss trends
+   - Extend with formal statistical process control: track Brier score over rolling windows, flag when it exceeds 2 standard deviations from historical mean
+
+**Confidence: HIGH** — These are standard psychometric validation techniques. math-php covers the computational needs. The existing codebase already implements the data pipeline.
+
+### Q2: Backtesting Approaches
+
+**Recommendation: Chronological walk-forward backtesting using existing data tables.**
+
+The system has historical answer data in `paper_questions` (with `is_correct`, `graded_at`) and `papers` (with `difficulty_category`). This is sufficient for backtesting.
+
+**Approach — Walk-Forward Validation:**
+
+1. **Data preparation**: Query all graded `paper_questions` ordered by `graded_at` ascending
+2. **Temporal split**: Use first 70% chronologically as calibration training data, last 30% as holdout
+3. **Replay**: Run `estimateByStratifiedResidual()` on training data to produce calibrated difficulties
+4. **Evaluate**: For each holdout answer, compute Brier score using the calibrated difficulty as predicted error probability
+5. **Baseline comparison**: Also compute Brier score using original `questions.difficulty` on the same holdout set
+6. **Report**: Brier Skill Score = `(Brier_original - Brier_calibrated) / Brier_original`. Positive = calibration improved predictions. Negative = calibration made things worse.
+
+**Implementation plan:**
+- New Artisan command `questions:difficulty-backtest` with options for `--train-ratio`, `--min-attempts`, `--since`
+- Reuses existing `QuestionDifficultyCalibrationAnalyzer` for per-question aggregation
+- Adds math-php `Descriptive` statistics for computing Brier components, confidence intervals
+- Outputs comparison table + JSON report
+
+**Key validation criteria:**
+- Brier Skill Score > 0.05 (calibration provides meaningful improvement)
+- Pearson correlation of calibrated difficulty vs empirical error rate > 0.3 (moderate positive relationship)
+- No systematic bias: calibration-in-the-large difference < 0.05
+- Calibration improves for questions with >= 10 attempts (minimum sample for meaningful update)
+
+**Confidence: HIGH** — The data exists in the database. The algorithm is already implemented. This is purely a validation harness.
+
+### Q3: Adaptive Testing Algorithms for Future Phases
+
+**Recommendation: Do not implement CAT now. Understand the landscape for Phase 3+ planning.**
+
+The current exam assembly pipeline uses a **strategy-based approach**: `IntelligentExamController` creates an `AssembleExamTaskJob` which selects questions based on knowledge points, chapters, difficulty categories, and distribution rules. This is a **fixed-form assembly** approach — all questions are determined before the student starts.
+
+**Standard CAT algorithms for reference (future use only):**
+
+1. **Item Selection**: Maximum Fisher Information — select the item that provides the most information about the student's ability at the current theta estimate. Requires IRT item parameters (a, b, c for 3PL model). The current `stratified_residual_eb_v2` difficulty could serve as the b-parameter.
+
+2. **Ability Estimation**:
+   - MLE (Maximum Likelihood Estimation) — no prior, can diverge with all-correct/all-wrong patterns
+   - EAP (Expected A Posteriori) — Bayesian, uses prior distribution. More stable. The existing Beta(2,2) prior in the calibration algorithm is conceptually similar.
+   - MAP (Maximum A Posteriori) — similar to EAP but uses mode instead of mean
+
+3. **Termination Criteria**:
+   - Standard Error threshold (stop when ability estimate precision is sufficient)
+   - Fixed-length (current approach — stop after N questions)
+   - SPRT (Sequential Probability Ratio Test) — stop when enough evidence accumulated to classify into mastery category
+
+4. **Exposure Control**:
+   - Randomesque — randomly select from K best items instead of the single best
+   - Sympson-Hetter — probabilistically control whether a selected item is actually administered
+   - Shadow testing — ensure remaining pool can still produce a valid exam
+
+**Why not CAT now**: CAT requires (1) validated item parameters, (2) real-time ability estimation, (3) sequential item selection. The current system has none of these. Validating calibration first (Phase 1) gives us (1). Then connecting calibration to exam assembly (Phase 2) improves fixed-form quality. CAT (Phase 3+) would require architectural changes to the exam flow.
+
+**Confidence: MEDIUM** — CAT theory is well-established in psychometric literature. Application to this specific K12 math context would need adaptation. The assessment is based on training data + Wikipedia verification of CAT components.
+
+### Q4: PHP/Laravel Libraries for Statistical Analysis
+
+**Recommendation: math-php ^2.13 as the sole addition.**
+
+| Need | Solution | Source |
+|------|----------|--------|
+| Pearson/Spearman correlation | `math-php Correlation` | Built-in |
+| P-values / significance | `math-php Significance` | Built-in |
+| Beta distribution (for credible intervals) | `math-php BetaDistribution` | Built-in |
+| Normal distribution (for confidence intervals) | `math-php NormalDistribution` | Built-in |
+| Binomial distribution (for exact probability tests) | `math-php BinomialDistribution` | Built-in |
+| Descriptive statistics | `math-php Descriptive` | Built-in |
+| ANOVA (category comparison) | `math-php ANOVA` | Built-in |
+| Linear regression (trend analysis) | `math-php LinearRegression` | Built-in |
+| Brier score computation | Custom code (trivial: `mean((predicted - observed)^2)`) | ~5 lines |
+| Brier score decomposition | Custom code using math-php `Descriptive` | ~30 lines |
+| Walk-forward backtest engine | Custom Artisan command | ~200 lines |
+| Calibration curve generation | Custom code in backtest command | ~50 lines |
+
+**No other PHP libraries needed.** The statistical requirements of this project are well within math-php's capabilities. Adding more libraries increases dependency surface area without meaningful benefit.
+
+## What NOT to Use
+
+| Technology | Why Not |
+|------------|---------|
+| Full IRT packages (R `ltm`, Python `py-irt`) | Would require rewriting the custom calibration algorithm. The algorithm works — validate it, don't replace it. |
+| Machine learning frameworks (TensorFlow PHP, PHP-ML) | Overkill for what is fundamentally a statistical estimation problem. PHP-ML has limited statistical tools and poorer documentation than math-php. |
+| External statistical services / APIs | Adds network dependency, latency, and deployment complexity for computations that take microseconds locally. |
+| Custom C extensions for math | math-php is pure PHP with no dependencies. Performance is adequate for the data volumes in K12 education (thousands to tens of thousands of answer records, not millions). |
+| `catsim` or any CAT library | Wrong phase. The system needs validated calibration before adaptive testing makes sense. |
+| `phpscience/statistics` | Stale, fewer features, less maintained than math-php. |
+| Any Python/R bridge | Adds deployment complexity. The math is simple enough for PHP. |
+
+## Version Pinning Strategy
+
+```
+composer require markrogoyski/math-php:^2.13
+```
+
+Use caret (`^`) version constraint. math-php follows semantic versioning — `^2.13` allows `2.13.x` and `2.14+` but not `3.0`. The library has been stable across 2.x with no breaking changes in minor versions.
+
+## Migration Path
+
+1. **Phase 1 (Validation)**: Install math-php. Build backtest command. Validate calibration accuracy. No changes to production code path.
+2. **Phase 2 (Integration)**: Wire validated calibration into exam assembly via existing `QuestionDifficultyResolver`. Still using math-php only for monitoring/reporting.
+3. **Phase 3 (Intelligence)**: Add mastery-based difficulty recommendation. Potentially introduce simplified adaptive strategies. math-php supports the statistical needs throughout.
+
+## Sources
+
+- markrogoyski/math-php v2.13.0 on Packagist: https://packagist.org/packages/markrogoyski/math-php — **HIGH confidence**, verified directly
+- IRT theory: Wikipedia Item Response Theory article — **HIGH confidence**, well-established psychometric theory
+- CAT algorithms: Wikipedia Computerized Adaptive Testing article — **HIGH confidence**, standard reference
+- Brier Score: Wikipedia Brier Score article — **HIGH confidence**, proper scoring rule definition and decomposition
+- Existing codebase analysis: `QuestionDifficultyCalibrationService.php` (973 lines), `QuestionDifficultyCalibrationAnalyzer.php` (608 lines), `IntelligentExamController.php` (1267 lines), `DifficultyDistributionService.php` (219 lines), `QuestionDifficultyResolver.php` (88 lines) — **HIGH confidence**, direct code inspection

+ 172 - 0
.planning/research/SUMMARY.md

@@ -0,0 +1,172 @@
+# Project Research Summary
+
+**Project:** Math CMS -- Difficulty Calibration & Intelligent Exam
+**Domain:** K12 educational assessment -- difficulty calibration validation, pipeline wiring, adaptive exam assembly
+**Researched:** 2026-04-16
+**Confidence:** HIGH
+
+## Executive Summary
+
+This project is a difficulty calibration refinement for an existing K12 math learning platform built on Laravel. The system already has a working calibration algorithm (`stratified_residual_eb_v2`), a calibration data pipeline, and an exam assembly engine -- but the calibration loop is disconnected from production. The calibrated difficulty values exist in a separate table and are never used when assembling exams. This means the entire calibration effort runs in isolation, producing values that have zero impact on the exams students receive.
+
+The recommended approach follows a strict gate-based progression: (1) validate the calibration algorithm against held-out historical data using temporal walk-forward backtesting, (2) fix the dual difficulty scale bug (0-1 vs 0-5 values mixed in the same column), (3) wire validated calibration into the exam assembly pipeline with shadow mode comparison before full activation, and (4) build mastery-based difficulty category recommendation on top of the validated foundation. The only new dependency is `markrogoyski/math-php ^2.13` for statistical validation metrics (Pearson/Spearman correlation, Brier score decomposition, confidence intervals). Everything else is built on existing code and infrastructure.
+
+The key risks are: circular validation (testing on training data), the dual difficulty scale silently corrupting calibration baselines, and low-sample-size questions dominating results. The research identifies 16 pitfalls total, with 8 rated critical or high severity. The most important mitigation is the temporal train/test split for validation -- without this, the entire effort is built on unverified metrics. A secondary risk is the health monitor degeneracy spiral, where bad initial calibration causes the system to become too cautious to self-correct.
+
+## Key Findings
+
+### Recommended Stack
+
+The project uses the existing Laravel/PHP/MySQL/Redis stack. The sole addition is `markrogoyski/math-php ^2.13`, a pure-PHP statistics library providing correlation functions, significance testing, probability distributions, and descriptive statistics needed for backtest validation. No external services, no Python bridges, no IRT frameworks.
+
+**Core technologies:**
+- **Laravel ^12.0 / PHP ^8.2**: Existing backend -- no framework changes needed
+- **MySQL**: All calibration data lives in `questions`, `question_difficulty_calibrations`, `paper_questions`, `papers` tables
+- **math-php ^2.13** (NEW): Statistical validation -- Pearson/Spearman correlation, Brier score decomposition, Beta distribution for credible intervals, ANOVA for cross-type comparison
+- **PHPUnit ^11.5.3**: Automated regression tests for calibration accuracy in CI
+
+### Expected Features
+
+**Must have (table stakes):**
+- **TS-1: Historical data backtesting with temporal split** -- validates the algorithm before production use; without this, everything else is unverified
+- **TS-2: Difficulty scale normalization** -- fixes a correctness bug where 0-5 scale values get treated as 0-1 scale values, inverting difficulty for affected questions
+- **TS-3: Calibrated difficulty wired into exam assembly** -- `QuestionDifficultyResolver` exists but is never called in the main assembly path
+- **TS-4: Difficulty distribution enabled by default** -- `DifficultyDistributionService` is fully implemented but `enable_difficulty_distribution` defaults to `false`
+- **TS-5: Calibration coverage metrics** -- shows what percentage of questions have calibrated values
+- **TS-6: Per-question calibration confidence indicator** -- surfaces sample-size-based confidence levels
+
+**Should have (differentiators):**
+- **D-1: Mastery-based difficulty category auto-recommendation** -- maps per-knowledge-point mastery to difficulty category, eliminating manual tier selection
+- **D-2: Shadow mode comparison logging** -- runs both raw and calibrated paths in parallel without affecting output
+- **D-5: Outlier quarantine with human review** -- flags questions whose calibration delta exceeds 0.30 from original
+
+**Defer (v2+):**
+- D-3 (Health dashboard): Needs 30+ days of accumulated snapshot data
+- D-4 (Drift detection): Requires health dashboard running first
+- D-6 (Per-knowledge-point breakdown): Nice-to-have analytics
+- Full CAT (computerized adaptive testing): Fundamentally different architecture, not this scope
+
+### Architecture Approach
+
+The target architecture adds five new service classes as layers on top of the existing system, following a strict dependency chain. The calibration algorithm itself is NOT modified -- the project validates it and connects it, not rewrites it.
+
+**Major components:**
+1. **CalibrationBacktestService** -- temporal walk-forward validation producing PASS/FAIL gate; blocks all downstream work
+2. **DifficultyNormalizationService** -- centralizes the existing `normalizeDifficultyValue()` logic at the question-loading boundary
+3. **QuestionDifficultyResolver** (existing, needs wiring) -- applies calibrated-over-original priority chain in the assembly path
+4. **CalibrationVerificationGate** -- post-calibration sanity checks: range validation, outlier flagging, quarantine
+5. **DifficultyCategoryRecommender** -- maps MasteryCalculator output (per-knowledge-point mastery) to difficulty category via empirical thresholds
+
+### Critical Pitfalls
+
+1. **Circular validation (Pitfall 1)** -- Testing calibration on the same data that produced it gives falsely good metrics. Prevention: strict temporal train/test split, walk-forward validation.
+2. **Dual difficulty scale corruption (Pitfall 2)** -- The `> 1.0` heuristic for distinguishing 0-1 from 0-5 scale fails for values in the 0.0-1.0 overlap zone. Prevention: audit distribution, add `difficulty_scale` flag column, or one-time migration.
+3. **Low sample size dominance (Pitfall 3)** -- Online calibration mode has no minimum sample threshold; a single answer can shift difficulty by 0.11. Prevention: enforce minimum weighted_attempts threshold (15-20) in the resolver before using calibrated values.
+4. **Health monitor degeneracy (Pitfall 8)** -- Bad initial calibration triggers health monitor caution, which slows correction, which keeps predictions bad, which triggers more caution. Prevention: raise health_scale floor to 0.6, add reset mechanism after 14 days below 0.7.
+5. **Stratified baseline self-reference (Pitfall 4)** -- Baselines computed from the same responses being calibrated create regression-to-the-mean artifacts. Prevention: leave-one-out baseline computation, audit stratum sizes.
+
+## Implications for Roadmap
+
+### Phase 1: Validation and Data Audit
+
+**Rationale:** Everything downstream depends on knowing whether the calibration algorithm actually works. This phase must complete before any production wiring. Scale normalization must happen first because inconsistent input data produces meaningless validation results.
+
+**Delivers:** PASS/FAIL verdict on calibration accuracy, data quality audit revealing scale and coverage issues, consistent 0-1 difficulty scale across all questions
+
+**Addresses:** TS-1 (backtesting), TS-2 (scale normalization), TS-5 (coverage metrics), TS-6 (confidence indicators), D-2 (shadow mode logging -- can activate immediately after PASS)
+
+**Avoids:** Pitfall 1 (circular validation), Pitfall 2 (dual scale), Pitfall 3 (low sample size -- audit reveals extent), Pitfall 7 (time decay -- test with and without), Pitfall 8 (health monitor degeneracy -- audit behavior during validation), Pitfall 16 (default 0.5 contamination -- count during audit)
+
+### Phase 2: Assembly Integration
+
+**Rationale:** Requires Phase 1 PASS gate open. Wires the validated calibration into the production exam assembly path using the gate-based activation pattern. Shadow mode runs first, then per-assembly-type rollout with feature flags.
+
+**Delivers:** Calibrated difficulty used in all exam assembly paths, difficulty distribution enabled by default, confidence-based fallback for low-sample questions
+
+**Addresses:** TS-3 (calibrated difficulty in assembly), TS-4 (distribution enabled), AF-5 enforcement (minimum sample size)
+
+**Avoids:** Pitfall 5 (no A/B testing -- shadow mode + per-type flags), Pitfall 6 (boundary effects -- soft boundaries), Pitfall 9 (mode inconsistency -- align thresholds), Pitfall 10 (pool exhaustion -- pre-analyze coverage), Pitfall 15 (race condition -- row-level locks)
+
+**Uses:** math-php for ongoing monitoring metrics, CalibrationVerificationGate for outlier quarantine
+
+### Phase 3: Adaptive Matching
+
+**Rationale:** Requires calibrated assembly working reliably in production (Phase 2 stable). Builds the intelligence layer that maps student mastery to recommended difficulty, closing the "answer -> calibrate -> precise exam -> re-answer" loop described in the project vision.
+
+**Delivers:** Per-knowledge-point difficulty category recommendation, automatic exam tier selection, student-facing confidence indicator
+
+**Addresses:** D-1 (mastery-based auto-recommendation), D-5 (outlier quarantine), D-7 (student confidence indicator)
+
+**Avoids:** Pitfall 11 (question type heterogeneity -- guessing correction for choice questions), Pitfall 12 (mastery mapping without ground truth -- use historical data to find optimal mapping), Pitfall 13 (feedback loop divergence -- track drift, periodic expert anchoring)
+
+### Phase 4: Health Monitoring (Ongoing)
+
+**Rationale:** Needs 30+ days of production calibration data flowing through the wired pipeline. Builds longitudinal tracking and drift detection infrastructure.
+
+**Delivers:** Daily health snapshots, drift detection alerts, calibration coverage tracking over time
+
+**Addresses:** D-3 (health dashboard), D-4 (drift detection)
+
+**Avoids:** Pitfall 13 (feedback loop divergence -- provides early warning), Pitfall 14 (JSON bloat -- extract health metrics to dedicated columns)
+
+### Phase Ordering Rationale
+
+- Phase 1 must come first because connecting unvalidated calibration to production could make exams worse, not better. The PASS/FAIL gate is a hard prerequisite.
+- Scale normalization (TS-2) must precede or coincide with backtesting (TS-1) because inconsistent input data produces meaningless validation results.
+- Phase 2 (assembly wiring) is separate from Phase 1 (validation) because they have different risk profiles: validation is read-only analysis, wiring modifies production behavior.
+- Phase 3 (adaptive matching) requires Phase 2 because the category recommender needs calibrated difficulty actively working in the assembly pipeline to validate its recommendations.
+- Phase 4 (health monitoring) is last because it needs accumulated production data from the wired pipeline to be meaningful.
+- The critical path is: TS-1 -> TS-2 -> TS-3 -> TS-4 -> D-1, as identified in FEATURES.md.
+
+### Research Flags
+
+Phases likely needing deeper research during planning:
+- **Phase 1 (Validation):** The backtesting approach (walk-forward with Brier score) is well-documented, but the specific PASS/FAIL thresholds (Pearson > 0.3, Mean Brier < 0.20) need validation against actual data distributions. The dual scale audit may reveal a worse problem than expected, requiring a data migration strategy. The stratified baseline self-reference issue (Pitfall 4) may require leave-one-out computation, which has performance implications for large datasets.
+- **Phase 3 (Adaptive Matching):** The mastery-to-category threshold mapping needs empirical grounding using historical (mastery, difficulty_category, score) data. This is not a standard library problem -- it requires domain-specific analysis. Question type heterogeneity (choice vs fill-in vs open-ended) may require separate mappings.
+
+Phases with standard patterns (skip research-phase):
+- **Phase 2 (Assembly Integration):** Wiring existing services together, adding feature flags, shadow mode logging. All architectural patterns are well-defined in ARCHITECTURE.md.
+- **Phase 4 (Health Monitoring):** Standard scheduled job + dashboard pattern. The computation logic already exists in `getHealthScaleForType()`.
+
+## Confidence Assessment
+
+| Area | Confidence | Notes |
+|------|------------|-------|
+| Stack | HIGH | Direct codebase analysis + Packagist verification. Only one new dependency (math-php). Existing stack well-understood. |
+| Features | HIGH | Features derived from codebase gap analysis (what exists vs what is disconnected). Competitive comparison (ALEKS, IXL, Khan Academy) at MEDIUM confidence for specific feature claims. |
+| Architecture | HIGH | Target architecture derived from existing component structure. All new services follow established Laravel patterns. Dependency chain is clear. |
+| Pitfalls | HIGH | Codebase pitfalls directly observed in source code. Domain pitfalls (IRT, calibration theory) are well-established in psychometric literature. |
+
+**Overall confidence:** HIGH
+
+### Gaps to Address
+
+- **Dual scale severity unknown:** The `> 1.0` heuristic may work fine if most questions use the same scale, or it may silently corrupt 30%+ of difficulty values. Phase 1 must start with a data distribution audit before any validation can be trusted.
+- **Brier Skill Score threshold:** Research does not establish what constitutes a "good enough" BSS for K12 math difficulty calibration. Industry practice varies by domain. Recommend running the backtest first, then setting the threshold based on the distribution of results.
+- **Calibration algorithm parameter sensitivity:** The existing algorithm has multiple tuning parameters (half_life_days, max_step, shrinkage constants). The backtest should test at least 2-3 parameter settings to understand sensitivity, but the research recommends against automatic tuning (AF-6).
+- **Question pool coverage by knowledge point:** Enabling difficulty distribution requires sufficient questions at each difficulty level per knowledge point. If certain knowledge points have sparse pools, the distribution strategy will fail or fall back heavily. This needs a coverage analysis before Phase 2 rollout.
+- **Mastery-to-category mapping thresholds:** The proposed thresholds (mastery 0.30 -> category 1, 0.50 -> category 2, etc.) are reasonable starting points but need empirical validation against historical score data. The "zone of proximal development" target (60-75% correct) should drive the actual thresholds.
+- **Sample size requirements:** The minimum number of attempts before a calibrated difficulty value should be used in production is set at 5 in the analyzer but 10 in the calibration constraints. The optimal threshold should be explored during backtesting.
+- **Long-term calibration drift:** The algorithm has a 45-day half-life for time decay, but the system lacks explicit drift detection. Formal drift detection (comparing rolling windows) is deferred to Phase 4.
+
+## Sources
+
+### Primary (HIGH confidence)
+- Direct codebase analysis: `QuestionDifficultyCalibrationService` (974 lines), `QuestionDifficultyCalibrationAnalyzer` (608 lines), `QuestionDifficultyResolver` (88 lines), `DifficultyDistributionService` (219 lines), `IntelligentExamController` (1267 lines), `LearningAnalyticsService`, `MasteryCalculator`, `ExamAnswerAnalysisService`
+- markrogoyski/math-php v2.13.0 on Packagist -- verified actively maintained, pure PHP, 3k+ GitHub stars
+- Brier score decomposition and proper scoring rules -- standard psychometric methodology
+- IRT theory and CAT algorithms -- well-established in psychometric literature (Lord & Novick, Wainer)
+
+### Secondary (MEDIUM confidence)
+- ALEKS Knowledge Space Theory -- proprietary, details inferred from published research and documentation
+- IXL Real-Time Diagnostic and Trouble Spots -- feature descriptions from product documentation
+- Khan Academy mastery system -- based on published descriptions and training data
+- Adaptive testing design patterns -- Wainer "Computerized Adaptive Testing: A Primer"
+
+### Tertiary (LOW confidence)
+- Student-facing confidence indicator UX patterns -- inferred from Khan Academy progress bars, not directly verified in this domain
+- Calibration drift detection thresholds (7-day vs 30-day comparison, >20% increase flag) -- reasonable but empirically derived
+
+---
+*Research completed: 2026-04-16*
+*Ready for roadmap: yes*