|
|
@@ -1,566 +0,0 @@
|
|
|
-# Architecture Patterns
|
|
|
-
|
|
|
-**Domain:** K12 math difficulty calibration and intelligent exam matching
|
|
|
-**Researched:** 2026-04-16
|
|
|
-**Confidence:** HIGH (based on direct codebase analysis); MEDIUM (general adaptive testing patterns from training data)
|
|
|
-
|
|
|
-## Current Architecture (As-Built)
|
|
|
-
|
|
|
-```
|
|
|
- ANSWER FLOW (already working)
|
|
|
- ============================
|
|
|
-
|
|
|
- Student answers exam
|
|
|
- |
|
|
|
- v
|
|
|
- ExamAnswerAnalysisService.analyzeExamAnswers()
|
|
|
- |
|
|
|
- +---> MasteryCalculator (knowledge point mastery)
|
|
|
- +---> KnowledgeMasteryService (persist mastery)
|
|
|
- +---> LocalAIAnalysisService (update mastery)
|
|
|
- +---> MistakeBookService (add to mistake book)
|
|
|
- +---> QuestionDifficultyCalibrationService.updateOnlineFromPaper()
|
|
|
- |
|
|
|
- v
|
|
|
- question_difficulty_calibrations table (upsert)
|
|
|
-
|
|
|
-
|
|
|
- ASSEMBLY FLOW (partially working)
|
|
|
- ================================
|
|
|
-
|
|
|
- POST /api/intelligent-exam
|
|
|
- |
|
|
|
- v
|
|
|
- IntelligentExamController.store()
|
|
|
- |
|
|
|
- v
|
|
|
- AssembleExamTaskJob (queued)
|
|
|
- |
|
|
|
- v
|
|
|
- LearningAnalyticsService.generateIntelligentExam()
|
|
|
- |
|
|
|
- +---> selectQuestions() -- uses raw questions.difficulty
|
|
|
- +---> applyTypeAwareDifficultyDistribution()
|
|
|
- | |
|
|
|
- | v
|
|
|
- | DifficultyDistributionService
|
|
|
- | (only if enable_difficulty_distribution=true,
|
|
|
- | which defaults to FALSE)
|
|
|
- |
|
|
|
- v
|
|
|
- QuestionDifficultyResolver.applyCalibratedDifficulty()
|
|
|
- (exists but NOT called in the main assembly path)
|
|
|
-```
|
|
|
-
|
|
|
-### Identified Gaps in Current Architecture
|
|
|
-
|
|
|
-| Gap | Location | Impact |
|
|
|
-|-----|----------|--------|
|
|
|
-| Calibration values not used in assembly | `LearningAnalyticsService` selects questions using raw `questions.difficulty` | Assembled exams use uncalibrated difficulty |
|
|
|
-| `enable_difficulty_distribution` defaults false | `LearningAnalyticsService` line 1554 | Distribution strategy never activates unless caller explicitly enables |
|
|
|
-| No auto difficulty_category recommendation | No service maps mastery to category | Teachers must manually pick tier; no student-level adaptation |
|
|
|
-| No backtesting validation | `QuestionDifficultyCalibrationAnalyzer` reports but does not validate | Algorithm accuracy unknown before production use |
|
|
|
-| Dual difficulty scale (0-1 vs 0-5) | `normalizeDifficultyValue()` divides by 5 if > 1.0 | Inconsistent source data enters calibration |
|
|
|
-
|
|
|
-## Recommended Target Architecture
|
|
|
-
|
|
|
-### Component Diagram
|
|
|
-
|
|
|
-```
|
|
|
-+===================================================================================+
|
|
|
-| TARGET ARCHITECTURE |
|
|
|
-+===================================================================================+
|
|
|
-
|
|
|
- LAYER 1: VALIDATION (must complete before anything else)
|
|
|
- +--------------------------------------------------------+
|
|
|
- | |
|
|
|
- | CalibrationBacktestService |
|
|
|
- | |-- backtestAgainstHistory(cutoffDate) |
|
|
|
- | |-- computeBrierScores(questionIds) |
|
|
|
- | |-- computePearsonCorrelation() |
|
|
|
- | |-- produceValidationReport() |
|
|
|
- | +-- PASS/FAIL gate: algo accuracy threshold |
|
|
|
- | |
|
|
|
- | Data: reads paper_questions + questions (historical) |
|
|
|
- | Writes: backtest_results table (or JSON export) |
|
|
|
- +--------------------------------------------------------+
|
|
|
- |
|
|
|
- | PASS gate opens production use
|
|
|
- v
|
|
|
-
|
|
|
- LAYER 2: CALIBRATION FEEDBACK LOOP (enhance existing)
|
|
|
- +--------------------------------------------------------+
|
|
|
- | |
|
|
|
- | ExamAnswerAnalysisService |
|
|
|
- | |-- (existing) analyzeExamAnswers() |
|
|
|
- | +-- (existing) recalibrateQuestionDifficulty() |
|
|
|
- | |
|
|
|
- | QuestionDifficultyCalibrationService |
|
|
|
- | |-- updateOnlineFromPaper() [existing, per-paper] |
|
|
|
- | |-- recalibrateQuestionIds() [existing, batch] |
|
|
|
- | +-- getHealthScaleForType() [existing, monitoring]|
|
|
|
- | |
|
|
|
- | NEW: CalibrationVerificationGate |
|
|
|
- | |-- validateCalibratedRange(questionIds) |
|
|
|
- | |-- flagOutliers(threshold) |
|
|
|
- | +-- quarantineBadCalibrations() |
|
|
|
- | |
|
|
|
- | Data: answer --> calibrate --> verify --> use |
|
|
|
- +--------------------------------------------------------+
|
|
|
- |
|
|
|
- v
|
|
|
-
|
|
|
- LAYER 3: ASSEMBLY INTEGRATION (connect calibration to exam)
|
|
|
- +--------------------------------------------------------+
|
|
|
- | |
|
|
|
- | DifficultyNormalizationService [NEW] |
|
|
|
- | |-- normalize(questionId) -> float [0,1] |
|
|
|
- | |-- batchNormalize(questionIds) -> map |
|
|
|
- | +-- resolves 0-1 vs 0-5 ambiguity at read time |
|
|
|
- | |
|
|
|
- | QuestionDifficultyResolver [existing, expand usage] |
|
|
|
- | |-- applyCalibratedDifficulty(questions) -> arr |
|
|
|
- | +-- MUST be called in assembly path |
|
|
|
- | |
|
|
|
- | LearningAnalyticsService |
|
|
|
- | |-- generateIntelligentExam() |
|
|
|
- | | +-- CALL DifficultyNormalizationService first |
|
|
|
- | | +-- CALL QuestionDifficultyResolver second |
|
|
|
- | | +-- SET enable_difficulty_distribution = true |
|
|
|
- | +-- remove hard-coded default false |
|
|
|
- | |
|
|
|
- | DifficultyDistributionService [existing] |
|
|
|
- | |-- calculateDistribution(category, total) |
|
|
|
- | +-- groupQuestionsByDifficultyRange() |
|
|
|
- +--------------------------------------------------------+
|
|
|
- |
|
|
|
- v
|
|
|
-
|
|
|
- LAYER 4: ADAPTIVE MATCHING (mastery-based difficulty selection)
|
|
|
- +--------------------------------------------------------+
|
|
|
- | |
|
|
|
- | DifficultyCategoryRecommender [NEW] |
|
|
|
- | |-- recommendForStudent(studentId, kpCodes) -> cat |
|
|
|
- | |-- recommendForKnowledgePoint(studentId, kp) -> cat|
|
|
|
- | +-- uses MasteryCalculator + calibration data |
|
|
|
- | |
|
|
|
- | MasteryCalculator [existing] |
|
|
|
- | |-- calculateMasteryLevel(studentId, kpCode) |
|
|
|
- | +-- returns mastery [0,1] + confidence + trend |
|
|
|
- | |
|
|
|
- | Mapping logic: |
|
|
|
- | mastery [0.0, 0.30) -> category 0 (zero-foundation) |
|
|
|
- | mastery [0.30, 0.50) -> category 1 (foundation) |
|
|
|
- | mastery [0.50, 0.70) -> category 2 (intermediate) |
|
|
|
- | mastery [0.70, 0.85) -> category 3 (advanced) |
|
|
|
- | mastery [0.85, 1.00) -> category 4 (competition) |
|
|
|
- +--------------------------------------------------------+
|
|
|
- |
|
|
|
- v
|
|
|
-
|
|
|
- LAYER 5: HEALTH MONITORING (continuous)
|
|
|
- +--------------------------------------------------------+
|
|
|
- | |
|
|
|
- | CalibrationHealthMonitor [NEW] |
|
|
|
- | |-- detectDrift(windowDays) -> drift report |
|
|
|
- | |-- accuracyTrend(days) -> accuracy over time |
|
|
|
- | |-- calibrationCoverage() -> % questions calibrated |
|
|
|
- | +-- scheduled artisan command (daily/weekly) |
|
|
|
- | |
|
|
|
- | Existing health mechanisms: |
|
|
|
- | |-- getHealthScaleForType() in CalibrationService |
|
|
|
- | +-- recent_events in algorithm_meta (per-question) |
|
|
|
- | |
|
|
|
- | NEW: calibration_health_snapshots table |
|
|
|
- | |-- date, total_calibrated, avg_brier, |
|
|
|
- | | coverage_pct, drift_flag, action |
|
|
|
- +--------------------------------------------------------+
|
|
|
-```
|
|
|
-
|
|
|
-### Component Boundaries
|
|
|
-
|
|
|
-| Component | Responsibility | Communicates With | New/Existing |
|
|
|
-|-----------|---------------|-------------------|--------------|
|
|
|
-| `CalibrationBacktestService` | Validate algorithm accuracy against historical data | Reads `paper_questions`, `questions`, `papers`. Writes report output. | NEW |
|
|
|
-| `QuestionDifficultyCalibrationService` | Core calibration algorithm (stratified_residual_eb_v2) | Called by `ExamAnswerAnalysisService`, `CalibrationBacktestService` | EXISTING |
|
|
|
-| `CalibrationVerificationGate` | Post-calibration sanity checks (range, outlier detection) | Reads `question_difficulty_calibrations`. Flags problematic entries. | NEW |
|
|
|
-| `DifficultyNormalizationService` | Unify 0-1 / 0-5 scale at read boundary | Called by `LearningAnalyticsService` during question loading | NEW |
|
|
|
-| `QuestionDifficultyResolver` | Apply calibrated difficulty to question arrays, calibrated-first | Called in assembly path by `LearningAnalyticsService` | EXISTING (needs wiring) |
|
|
|
-| `DifficultyDistributionService` | Calculate difficulty buckets per category | Called by `LearningAnalyticsService` when distribution enabled | EXISTING |
|
|
|
-| `DifficultyCategoryRecommender` | Map student mastery to recommended difficulty category | Reads `MasteryCalculator`. Used by `IntelligentExamController` | NEW |
|
|
|
-| `MasteryCalculator` | Calculate per-knowledge-point mastery levels | Existing, unchanged | EXISTING |
|
|
|
-| `CalibrationHealthMonitor` | Detect calibration drift, coverage gaps, accuracy degradation | Reads `question_difficulty_calibrations`. Writes health snapshots. | NEW |
|
|
|
-| `LearningAnalyticsService` | Orchestrate question selection and difficulty distribution | Must call normalization + resolver + distribution | EXISTING (needs modification) |
|
|
|
-
|
|
|
-### Data Flow
|
|
|
-
|
|
|
-```
|
|
|
-QUESTION DIFFICULTY LIFECYCLE
|
|
|
-=============================
|
|
|
-
|
|
|
- questions.difficulty (original, immutable)
|
|
|
- |
|
|
|
- v
|
|
|
- DifficultyNormalizationService.normalize()
|
|
|
- | (resolves 0-1 vs 0-5, stores original_difficulty)
|
|
|
- v
|
|
|
- question_difficulty_calibrations.original_difficulty
|
|
|
- |
|
|
|
- +---[calibration loop]---> calibrated_difficulty
|
|
|
- | |
|
|
|
- v v
|
|
|
- QuestionDifficultyResolver.applyCalibratedDifficulty()
|
|
|
- |
|
|
|
- | Returns: calibrated if exists, else original (normalized)
|
|
|
- v
|
|
|
- DifficultyDistributionService.groupQuestionsByDifficultyRange()
|
|
|
- |
|
|
|
- v
|
|
|
- Selected questions for exam assembly
|
|
|
-
|
|
|
-
|
|
|
-ANSWER-TO-CALIBRATION FEEDBACK LOOP
|
|
|
-====================================
|
|
|
-
|
|
|
- Student submits exam answers
|
|
|
- |
|
|
|
- v
|
|
|
- ExamAnswerAnalysisService.analyzeExamAnswers()
|
|
|
- |
|
|
|
- +---> MasteryCalculator (update knowledge mastery)
|
|
|
- +---> QuestionDifficultyCalibrationService.updateOnlineFromPaper()
|
|
|
- |
|
|
|
- +-- per-question: compute residual, apply shrinkage, clamp
|
|
|
- +-- upsert to question_difficulty_calibrations
|
|
|
- +-- append to recent_events in algorithm_meta
|
|
|
- |
|
|
|
- v
|
|
|
- CalibrationVerificationGate (NEW)
|
|
|
- |
|
|
|
- +-- check calibrated_difficulty in [0.01, 0.99]
|
|
|
- +-- flag if delta > 0.30 from original
|
|
|
- +-- quarantine if Brier score deteriorating
|
|
|
- |
|
|
|
- v
|
|
|
- Health monitor caches invalidated
|
|
|
- (getHealthScaleForType will recompute next call)
|
|
|
-
|
|
|
-
|
|
|
-BACKTESTING VALIDATION FLOW
|
|
|
-============================
|
|
|
-
|
|
|
- CalibrationBacktestService.backtestAgainstHistory(cutoffDate)
|
|
|
- |
|
|
|
- v
|
|
|
- 1. Load all questions with >= N attempts before cutoffDate
|
|
|
- 2. Split: training set (before cutoff) vs test set (after cutoff)
|
|
|
- 3. Run calibration on training data only
|
|
|
- 4. For each question in test set:
|
|
|
- - predicted = calibrated_difficulty from training
|
|
|
- - actual = observed error rate in test period
|
|
|
- - Brier score = (predicted - actual)^2
|
|
|
- 5. Aggregate metrics:
|
|
|
- - Mean Brier score (lower = better, < 0.15 is acceptable)
|
|
|
- - Pearson correlation (predicted vs actual, > 0.4 is acceptable)
|
|
|
- - Calibration coverage (% questions with enough data)
|
|
|
- - MAE (mean absolute error, < 0.15 is acceptable)
|
|
|
- 6. PASS gate:
|
|
|
- - Pearson > 0.3 AND Mean Brier < 0.20
|
|
|
- - If FAIL: algorithm needs tuning, do NOT enable in production
|
|
|
- |
|
|
|
- v
|
|
|
- Report: JSON/CSV output + PASS/FAIL verdict
|
|
|
-
|
|
|
-
|
|
|
-MASTERY-TO-DIFFICULTY MATCHING FLOW
|
|
|
-====================================
|
|
|
-
|
|
|
- Exam request (student_id + kp_codes)
|
|
|
- |
|
|
|
- v
|
|
|
- DifficultyCategoryRecommender.recommendForStudent()
|
|
|
- |
|
|
|
- +-- for each kp_code:
|
|
|
- | MasteryCalculator.calculateMasteryLevel(studentId, kp)
|
|
|
- | -> mastery [0,1], confidence, trend
|
|
|
- |
|
|
|
- +-- aggregate mastery across kp_codes (weighted average)
|
|
|
- +-- map to category:
|
|
|
- | mastery -> category via threshold table
|
|
|
- | adjust for trend: trending up -> +0.5 category push
|
|
|
- | floor at 0, cap at 4
|
|
|
- |
|
|
|
- +-- return: recommended category + confidence + reasoning
|
|
|
- |
|
|
|
- v
|
|
|
- IntelligentExamController uses recommended category
|
|
|
- |
|
|
|
- v
|
|
|
- LearningAnalyticsService with enable_difficulty_distribution=true
|
|
|
-```
|
|
|
-
|
|
|
-## Patterns to Follow
|
|
|
-
|
|
|
-### Pattern 1: Gate-Based Progressive Activation
|
|
|
-
|
|
|
-**What:** A calibration value must pass validation before it can influence production behavior. Once validated, components progressively unlock.
|
|
|
-**When:** Any system where unvalidated statistical estimates would harm user experience.
|
|
|
-**Why this matters here:** The project explicitly requires "validation before production use." The current code already has calibration running, but it is not connected to assembly. This is correct; the backtest gate formalizes the transition.
|
|
|
-
|
|
|
-```
|
|
|
-Gate states:
|
|
|
- LOCKED -- calibration runs, values stored, NOT used in assembly
|
|
|
- TESTED -- backtest passed, enable for shadow mode (log but don't act)
|
|
|
- ACTIVE -- fully enabled in production assembly path
|
|
|
-```
|
|
|
-
|
|
|
-**Example implementation:**
|
|
|
-
|
|
|
-```php
|
|
|
-// In a config or database table
|
|
|
-'calibration_gate' => 'locked', // locked | tested | active
|
|
|
-
|
|
|
-// In LearningAnalyticsService, during assembly:
|
|
|
-if (config('calibration_gate') === 'active') {
|
|
|
- $questions = $resolver->applyCalibratedDifficulty($questions);
|
|
|
-}
|
|
|
-```
|
|
|
-
|
|
|
-### Pattern 2: Difficulty Source Priority Chain
|
|
|
-
|
|
|
-**What:** When multiple difficulty values exist for a question, follow a deterministic priority chain rather than ad-hoc logic.
|
|
|
-**When:** Any lookup where calibrated, original, and estimated values coexist.
|
|
|
-**Why:** The current `QuestionDifficultyResolver` already implements this pattern correctly (calibrated > original). It just needs to be consistently called.
|
|
|
-
|
|
|
-```php
|
|
|
-// Priority chain (already implemented in QuestionDifficultyResolver):
|
|
|
-// 1. calibrated_difficulty (from question_difficulty_calibrations)
|
|
|
-// 2. normalized questions.difficulty (0-1 scale, divide-by-5 if needed)
|
|
|
-// 3. fallback 0.5 (moderate default)
|
|
|
-```
|
|
|
-
|
|
|
-### Pattern 3: Shadow Mode Before Activation
|
|
|
-
|
|
|
-**What:** Before enabling calibrated difficulty in actual exam assembly, run both paths in parallel and compare results without affecting output.
|
|
|
-**When:** Connecting a validated but previously-disconnected statistical system to production.
|
|
|
-**Why:** Even with backtest validation, real-time behavior may differ from historical backtest. Shadow mode catches integration bugs.
|
|
|
-
|
|
|
-```php
|
|
|
-// In LearningAnalyticsService assembly:
|
|
|
-$rawQuestions = $selectedQuestions; // current behavior
|
|
|
-$calibratedQuestions = $resolver->applyCalibratedDifficulty($rawQuestions);
|
|
|
-
|
|
|
-// Log comparison without using calibrated values yet
|
|
|
-Log::info('Shadow mode difficulty comparison', [
|
|
|
- 'raw_avg' => collect($rawQuestions)->avg('difficulty'),
|
|
|
- 'calibrated_avg' => collect($calibratedQuestions)->avg('difficulty'),
|
|
|
- 'diff_count' => count(array_filter($calibratedQuestions, fn($q) =>
|
|
|
- ($q['difficulty_source'] ?? '') === 'calibrated'
|
|
|
- )),
|
|
|
-]);
|
|
|
-
|
|
|
-// Use raw (unchanged behavior) until gate opens
|
|
|
-$selectedQuestions = $rawQuestions;
|
|
|
-```
|
|
|
-
|
|
|
-### Pattern 4: Stratified Baseline with Residual Adjustment
|
|
|
-
|
|
|
-**What:** Already implemented in the codebase. The calibration algorithm computes expected error rates per (question_type, difficulty_category) stratum, then adjusts based on the residual (observed - expected).
|
|
|
-**When:** This is the core calibration algorithm. No changes needed to the algorithm itself per project scope.
|
|
|
-
|
|
|
-The existing algorithm is well-structured:
|
|
|
-- Global baselines: `buildGlobalBaselines()` computes per-stratum error rates
|
|
|
-- Online update: `estimateOnlineBySingleOutcome()` processes one answer event
|
|
|
-- Batch update: `estimateByStratifiedResidual()` processes historical data
|
|
|
-- Health scaling: `getHealthScaleForType()` auto-reduces step size when degrading
|
|
|
-
|
|
|
-### Pattern 5: Time-Decay Weighted Statistics
|
|
|
-
|
|
|
-**What:** Weight recent observations more heavily than old ones using exponential decay. Already implemented with 45-day half-life.
|
|
|
-**When:** Any aggregation of student performance or calibration data.
|
|
|
-**Why:** K12 students improve; old responses are less predictive. The existing 45-day half-life is reasonable.
|
|
|
-
|
|
|
-## Anti-Patterns to Avoid
|
|
|
-
|
|
|
-### Anti-Pattern 1: Backfilling questions.difficulty
|
|
|
-
|
|
|
-**What:** Writing calibrated values back to `questions.difficulty`.
|
|
|
-**Why bad:** Destroys the original reference value, makes debugging impossible, violates project constraint.
|
|
|
-**Instead:** Keep the dual-table design. `questions.difficulty` is append-only immutable. `question_difficulty_calibrations` is the mutable overlay.
|
|
|
-
|
|
|
-### Anti-Pattern 2: Global Difficulty Category Override
|
|
|
-
|
|
|
-**What:** Computing one difficulty_category for a student across all knowledge points and applying it everywhere.
|
|
|
-**Why bad:** A student may be advanced in algebra but beginner in geometry. Global category creates mismatched exams.
|
|
|
-**Instead:** Per-knowledge-point mastery -> per-knowledge-point difficulty recommendation. Aggregate only when exam spans multiple knowledge points.
|
|
|
-
|
|
|
-### Anti-Pattern 3: Calibration Without Verification
|
|
|
-
|
|
|
-**What:** Wiring calibration directly into the assembly path without the backtest validation step.
|
|
|
-**Why bad:** If the algorithm has systematic bias (e.g., always overestimates difficulty for certain question types), it makes exams worse, not better.
|
|
|
-**Instead:** Backtest first. The backtest is a prerequisite gate, not an optional report.
|
|
|
-
|
|
|
-### Anti-Pattern 4: Dual-Scale Leakage
|
|
|
-
|
|
|
-**What:** Mixing 0-5 scale difficulty values with 0-1 scale values in the same computation.
|
|
|
-**Why bad:** A 0-5 value of 0.4 (easy) gets treated as 0.4 on 0-1 scale (hard), producing inverted difficulty estimates.
|
|
|
-**Instead:** Normalize at the read boundary. The existing `normalizeDifficultyValue()` in `QuestionDifficultyCalibrationService` handles this for calibration input, but `LearningAnalyticsService` does not normalize when loading questions for assembly. This must be fixed.
|
|
|
-
|
|
|
-### Anti-Pattern 5: Calibration Feedback Loop Without Rate Limiting
|
|
|
-
|
|
|
-**What:** Allowing calibration to update on every single answer without any dampening.
|
|
|
-**Why bad:** A single anomalous cohort (e.g., a class that all guesses randomly) can corrupt calibration values.
|
|
|
-**Instead:** The existing algorithm handles this well with: shrinkage (prior pulls toward original), step limits, minimum sample thresholds, and health scaling. Do not remove these safeguards.
|
|
|
-
|
|
|
-## Scalability Considerations
|
|
|
-
|
|
|
-| Concern | At 100 questions | At 10K questions | At 100K questions |
|
|
|
-|---------|------------------|-------------------|-------------------|
|
|
|
-| Calibration table size | Negligible | ~10K rows, fast with index on `question_bank_id` | ~100K rows; add composite index on `(calibrated_difficulty, updated_at)` |
|
|
|
-| Backtest computation | < 1 second | 5-30 seconds depending on attempt count | Minutes; run as queued job, cache results |
|
|
|
-| Per-answer calibration | < 10ms (single upsert) | < 10ms | < 10ms (indexed lookup + single upsert) |
|
|
|
-| Health monitoring | Negligible (scans recent rows) | 1-5 seconds (parsing algorithm_meta JSON) | 5-30 seconds; extract health metrics to dedicated columns |
|
|
|
-| Mastery-to-category recommendation | < 50ms (1 mastery lookup) | < 50ms | < 100ms (batch mastery lookup for multiple kp_codes) |
|
|
|
-| `applyCalibratedDifficulty` batch | < 5ms | < 20ms (WHERE IN) | < 100ms; add chunking for > 1000 IDs |
|
|
|
-
|
|
|
-## Suggested Build Order
|
|
|
-
|
|
|
-```
|
|
|
-Phase 1: VALIDATION (must be first -- blocks everything else)
|
|
|
- 1.1 CalibrationBacktestService
|
|
|
- - Reads historical data
|
|
|
- - Computes Pearson, Brier, MAE
|
|
|
- - Produces PASS/FAIL report
|
|
|
- DEPENDS ON: existing QuestionDifficultyCalibrationService, existing data
|
|
|
- BLOCKS: Phase 2, 3, 4
|
|
|
-
|
|
|
-Phase 2: DIFFICULTY STANDARDIZATION (no behavioral change)
|
|
|
- 2.1 DifficultyNormalizationService
|
|
|
- - Extract and centralize normalizeDifficultyValue() logic
|
|
|
- - Apply at question-loading boundary in LearningAnalyticsService
|
|
|
- DEPENDS ON: nothing new
|
|
|
- BLOCKS: Phase 3 (need consistent scale before using calibration)
|
|
|
-
|
|
|
-Phase 3: ASSEMBLY INTEGRATION (wires calibration into production)
|
|
|
- 3.1 Wire QuestionDifficultyResolver into LearningAnalyticsService
|
|
|
- - Call applyCalibratedDifficulty() in the assembly path
|
|
|
- - Enable difficulty_distribution by default
|
|
|
- - Add shadow mode logging first, then activate
|
|
|
- 3.2 CalibrationVerificationGate
|
|
|
- - Post-upsert sanity checks
|
|
|
- - Outlier quarantine
|
|
|
- DEPENDS ON: Phase 1 (PASS gate), Phase 2 (consistent scale)
|
|
|
- BLOCKS: Phase 4
|
|
|
-
|
|
|
-Phase 4: ADAPTIVE MATCHING (new feature)
|
|
|
- 4.1 DifficultyCategoryRecommender
|
|
|
- - Map mastery -> category
|
|
|
- - Per-kp and aggregate recommendations
|
|
|
- 4.2 Wire into IntelligentExamController
|
|
|
- - Auto-fill difficulty_category when not specified
|
|
|
- DEPENDS ON: Phase 3 (calibrated assembly working)
|
|
|
-
|
|
|
-Phase 5: HEALTH MONITORING (ongoing)
|
|
|
- 5.1 CalibrationHealthMonitor
|
|
|
- - Scheduled artisan command
|
|
|
- - Drift detection, coverage tracking
|
|
|
- - calibration_health_snapshots table
|
|
|
- 5.2 Alert logic
|
|
|
- - Flag when Brier degrades, coverage drops, drift detected
|
|
|
- DEPENDS ON: Phase 3 (need production calibration data flowing)
|
|
|
-```
|
|
|
-
|
|
|
-### Dependency Graph
|
|
|
-
|
|
|
-```
|
|
|
-Phase 1 (Validation)
|
|
|
- |
|
|
|
- v
|
|
|
-Phase 2 (Standardization) -----> Phase 3 (Assembly Integration)
|
|
|
- |
|
|
|
- v
|
|
|
- Phase 4 (Adaptive Matching)
|
|
|
- |
|
|
|
- v
|
|
|
- Phase 5 (Health Monitoring)
|
|
|
-```
|
|
|
-
|
|
|
-Phase 2 can run in parallel with Phase 1 since it does not depend on validation results. It only centralizes existing normalization logic. Phase 3 requires both Phase 1 PASS and Phase 2 completion. Phases 4 and 5 are sequential after Phase 3.
|
|
|
-
|
|
|
-## Data Model Additions
|
|
|
-
|
|
|
-### New Table: `calibration_health_snapshots`
|
|
|
-
|
|
|
-```sql
|
|
|
-CREATE TABLE calibration_health_snapshots (
|
|
|
- id BIGINT UNSIGNED AUTO_INCREMENT PRIMARY KEY,
|
|
|
- snapshot_date DATE NOT NULL,
|
|
|
- total_questions INT UNSIGNED DEFAULT 0,
|
|
|
- calibrated_count INT UNSIGNED DEFAULT 0,
|
|
|
- coverage_pct DECIMAL(5,2) DEFAULT 0,
|
|
|
- avg_brier_score DECIMAL(8,6) DEFAULT NULL,
|
|
|
- avg_logloss DECIMAL(8,6) DEFAULT NULL,
|
|
|
- pearson_correlation DECIMAL(8,4) DEFAULT NULL,
|
|
|
- mean_abs_residual DECIMAL(8,4) DEFAULT NULL,
|
|
|
- health_scale_avg DECIMAL(5,3) DEFAULT NULL,
|
|
|
- drift_flag TINYINT(1) DEFAULT 0,
|
|
|
- drift_details JSON DEFAULT NULL,
|
|
|
- action VARCHAR(32) DEFAULT 'none',
|
|
|
- created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
|
|
|
- UNIQUE KEY idx_snapshot_date (snapshot_date)
|
|
|
-);
|
|
|
-```
|
|
|
-
|
|
|
-### New Table: `backtest_results`
|
|
|
-
|
|
|
-```sql
|
|
|
-CREATE TABLE backtest_results (
|
|
|
- id BIGINT UNSIGNED AUTO_INCREMENT PRIMARY KEY,
|
|
|
- run_id VARCHAR(64) NOT NULL,
|
|
|
- cutoff_date DATE NOT NULL,
|
|
|
- question_bank_id BIGINT UNSIGNED NOT NULL,
|
|
|
- training_attempts INT UNSIGNED DEFAULT 0,
|
|
|
- test_attempts INT UNSIGNED DEFAULT 0,
|
|
|
- predicted_difficulty DECIMAL(6,4) DEFAULT NULL,
|
|
|
- observed_error_rate DECIMAL(6,4) DEFAULT NULL,
|
|
|
- brier_score DECIMAL(8,6) DEFAULT NULL,
|
|
|
- absolute_error DECIMAL(6,4) DEFAULT NULL,
|
|
|
- created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
|
|
|
- INDEX idx_run_id (run_id),
|
|
|
- INDEX idx_cutoff (cutoff_date)
|
|
|
-);
|
|
|
-```
|
|
|
-
|
|
|
-No changes needed to existing tables. The `question_difficulty_calibrations` table schema is sufficient.
|
|
|
-
|
|
|
-## Key Design Decisions
|
|
|
-
|
|
|
-### Decision 1: Calibration is an Overlay, Not a Replacement
|
|
|
-
|
|
|
-The calibrated difficulty is an overlay on top of the original difficulty. The `QuestionDifficultyResolver` already implements this correctly: calibrated value takes priority, original value is the fallback. This must remain the design. Never write calibrated values back to `questions.difficulty`.
|
|
|
-
|
|
|
-### Decision 2: Gate-Based Activation, Not Feature Flags
|
|
|
-
|
|
|
-Use a deterministic gate (backtest PASS/FAIL) rather than a manual feature flag to enable calibration in production. The gate should be an artisan command that sets a config value or database flag after validation passes. This prevents human error from enabling an unvalidated algorithm.
|
|
|
-
|
|
|
-### Decision 3: Per-Knowledge-Point Difficulty Recommendation
|
|
|
-
|
|
|
-When recommending difficulty_category for a student, compute per-knowledge-point mastery and map each to a category. If an exam covers multiple knowledge points, use the weighted average of their recommended categories, weighted by the student's weakness level (weaker knowledge points get more weight to avoid overwhelming the student).
|
|
|
-
|
|
|
-### Decision 4: Health Monitoring is Separate from Calibration
|
|
|
-
|
|
|
-The existing `getHealthScaleForType()` already provides inline health adjustment. The new `CalibrationHealthMonitor` serves a different purpose: longitudinal tracking and alerting. It should NOT modify calibration behavior directly; instead, it produces reports that humans review to decide if algorithm parameters need adjustment.
|
|
|
-
|
|
|
-### Decision 5: Backtesting Uses Temporal Split, Not Random Split
|
|
|
-
|
|
|
-When validating the calibration algorithm, split data by time (cutoff date) rather than randomly. This is critical because:
|
|
|
-1. The algorithm includes time decay, so temporal ordering matters
|
|
|
-2. Random splits would leak future information into training
|
|
|
-3. Real deployment processes data chronologically
|
|
|
-
|
|
|
-## Sources
|
|
|
-
|
|
|
-- Direct codebase analysis of all referenced PHP service files (HIGH confidence)
|
|
|
-- Existing `question_difficulty_calibrations` migration schema (HIGH confidence)
|
|
|
-- Adaptive testing and IRT architecture patterns from training data (MEDIUM confidence -- standard patterns in psychometrics literature)
|
|
|
-- Brier score and calibration validation approaches from training data (MEDIUM confidence -- well-established statistical methodology)
|