# Architecture Patterns **Domain:** K12 math difficulty calibration and intelligent exam matching **Researched:** 2026-04-16 **Confidence:** HIGH (based on direct codebase analysis); MEDIUM (general adaptive testing patterns from training data) ## Current Architecture (As-Built) ``` ANSWER FLOW (already working) ============================ Student answers exam | v ExamAnswerAnalysisService.analyzeExamAnswers() | +---> MasteryCalculator (knowledge point mastery) +---> KnowledgeMasteryService (persist mastery) +---> LocalAIAnalysisService (update mastery) +---> MistakeBookService (add to mistake book) +---> QuestionDifficultyCalibrationService.updateOnlineFromPaper() | v question_difficulty_calibrations table (upsert) ASSEMBLY FLOW (partially working) ================================ POST /api/intelligent-exam | v IntelligentExamController.store() | v AssembleExamTaskJob (queued) | v LearningAnalyticsService.generateIntelligentExam() | +---> selectQuestions() -- uses raw questions.difficulty +---> applyTypeAwareDifficultyDistribution() | | | v | DifficultyDistributionService | (only if enable_difficulty_distribution=true, | which defaults to FALSE) | v QuestionDifficultyResolver.applyCalibratedDifficulty() (exists but NOT called in the main assembly path) ``` ### Identified Gaps in Current Architecture | Gap | Location | Impact | |-----|----------|--------| | Calibration values not used in assembly | `LearningAnalyticsService` selects questions using raw `questions.difficulty` | Assembled exams use uncalibrated difficulty | | `enable_difficulty_distribution` defaults false | `LearningAnalyticsService` line 1554 | Distribution strategy never activates unless caller explicitly enables | | No auto difficulty_category recommendation | No service maps mastery to category | Teachers must manually pick tier; no student-level adaptation | | No backtesting validation | `QuestionDifficultyCalibrationAnalyzer` reports but does not validate | Algorithm accuracy unknown before production use | | Dual difficulty scale (0-1 vs 0-5) | `normalizeDifficultyValue()` divides by 5 if > 1.0 | Inconsistent source data enters calibration | ## Recommended Target Architecture ### Component Diagram ``` +===================================================================================+ | TARGET ARCHITECTURE | +===================================================================================+ LAYER 1: VALIDATION (must complete before anything else) +--------------------------------------------------------+ | | | CalibrationBacktestService | | |-- backtestAgainstHistory(cutoffDate) | | |-- computeBrierScores(questionIds) | | |-- computePearsonCorrelation() | | |-- produceValidationReport() | | +-- PASS/FAIL gate: algo accuracy threshold | | | | Data: reads paper_questions + questions (historical) | | Writes: backtest_results table (or JSON export) | +--------------------------------------------------------+ | | PASS gate opens production use v LAYER 2: CALIBRATION FEEDBACK LOOP (enhance existing) +--------------------------------------------------------+ | | | ExamAnswerAnalysisService | | |-- (existing) analyzeExamAnswers() | | +-- (existing) recalibrateQuestionDifficulty() | | | | QuestionDifficultyCalibrationService | | |-- updateOnlineFromPaper() [existing, per-paper] | | |-- recalibrateQuestionIds() [existing, batch] | | +-- getHealthScaleForType() [existing, monitoring]| | | | NEW: CalibrationVerificationGate | | |-- validateCalibratedRange(questionIds) | | |-- flagOutliers(threshold) | | +-- quarantineBadCalibrations() | | | | Data: answer --> calibrate --> verify --> use | +--------------------------------------------------------+ | v LAYER 3: ASSEMBLY INTEGRATION (connect calibration to exam) +--------------------------------------------------------+ | | | DifficultyNormalizationService [NEW] | | |-- normalize(questionId) -> float [0,1] | | |-- batchNormalize(questionIds) -> map | | +-- resolves 0-1 vs 0-5 ambiguity at read time | | | | QuestionDifficultyResolver [existing, expand usage] | | |-- applyCalibratedDifficulty(questions) -> arr | | +-- MUST be called in assembly path | | | | LearningAnalyticsService | | |-- generateIntelligentExam() | | | +-- CALL DifficultyNormalizationService first | | | +-- CALL QuestionDifficultyResolver second | | | +-- SET enable_difficulty_distribution = true | | +-- remove hard-coded default false | | | | DifficultyDistributionService [existing] | | |-- calculateDistribution(category, total) | | +-- groupQuestionsByDifficultyRange() | +--------------------------------------------------------+ | v LAYER 4: ADAPTIVE MATCHING (mastery-based difficulty selection) +--------------------------------------------------------+ | | | DifficultyCategoryRecommender [NEW] | | |-- recommendForStudent(studentId, kpCodes) -> cat | | |-- recommendForKnowledgePoint(studentId, kp) -> cat| | +-- uses MasteryCalculator + calibration data | | | | MasteryCalculator [existing] | | |-- calculateMasteryLevel(studentId, kpCode) | | +-- returns mastery [0,1] + confidence + trend | | | | Mapping logic: | | mastery [0.0, 0.30) -> category 0 (zero-foundation) | | mastery [0.30, 0.50) -> category 1 (foundation) | | mastery [0.50, 0.70) -> category 2 (intermediate) | | mastery [0.70, 0.85) -> category 3 (advanced) | | mastery [0.85, 1.00) -> category 4 (competition) | +--------------------------------------------------------+ | v LAYER 5: HEALTH MONITORING (continuous) +--------------------------------------------------------+ | | | CalibrationHealthMonitor [NEW] | | |-- detectDrift(windowDays) -> drift report | | |-- accuracyTrend(days) -> accuracy over time | | |-- calibrationCoverage() -> % questions calibrated | | +-- scheduled artisan command (daily/weekly) | | | | Existing health mechanisms: | | |-- getHealthScaleForType() in CalibrationService | | +-- recent_events in algorithm_meta (per-question) | | | | NEW: calibration_health_snapshots table | | |-- date, total_calibrated, avg_brier, | | | coverage_pct, drift_flag, action | +--------------------------------------------------------+ ``` ### Component Boundaries | Component | Responsibility | Communicates With | New/Existing | |-----------|---------------|-------------------|--------------| | `CalibrationBacktestService` | Validate algorithm accuracy against historical data | Reads `paper_questions`, `questions`, `papers`. Writes report output. | NEW | | `QuestionDifficultyCalibrationService` | Core calibration algorithm (stratified_residual_eb_v2) | Called by `ExamAnswerAnalysisService`, `CalibrationBacktestService` | EXISTING | | `CalibrationVerificationGate` | Post-calibration sanity checks (range, outlier detection) | Reads `question_difficulty_calibrations`. Flags problematic entries. | NEW | | `DifficultyNormalizationService` | Unify 0-1 / 0-5 scale at read boundary | Called by `LearningAnalyticsService` during question loading | NEW | | `QuestionDifficultyResolver` | Apply calibrated difficulty to question arrays, calibrated-first | Called in assembly path by `LearningAnalyticsService` | EXISTING (needs wiring) | | `DifficultyDistributionService` | Calculate difficulty buckets per category | Called by `LearningAnalyticsService` when distribution enabled | EXISTING | | `DifficultyCategoryRecommender` | Map student mastery to recommended difficulty category | Reads `MasteryCalculator`. Used by `IntelligentExamController` | NEW | | `MasteryCalculator` | Calculate per-knowledge-point mastery levels | Existing, unchanged | EXISTING | | `CalibrationHealthMonitor` | Detect calibration drift, coverage gaps, accuracy degradation | Reads `question_difficulty_calibrations`. Writes health snapshots. | NEW | | `LearningAnalyticsService` | Orchestrate question selection and difficulty distribution | Must call normalization + resolver + distribution | EXISTING (needs modification) | ### Data Flow ``` QUESTION DIFFICULTY LIFECYCLE ============================= questions.difficulty (original, immutable) | v DifficultyNormalizationService.normalize() | (resolves 0-1 vs 0-5, stores original_difficulty) v question_difficulty_calibrations.original_difficulty | +---[calibration loop]---> calibrated_difficulty | | v v QuestionDifficultyResolver.applyCalibratedDifficulty() | | Returns: calibrated if exists, else original (normalized) v DifficultyDistributionService.groupQuestionsByDifficultyRange() | v Selected questions for exam assembly ANSWER-TO-CALIBRATION FEEDBACK LOOP ==================================== Student submits exam answers | v ExamAnswerAnalysisService.analyzeExamAnswers() | +---> MasteryCalculator (update knowledge mastery) +---> QuestionDifficultyCalibrationService.updateOnlineFromPaper() | +-- per-question: compute residual, apply shrinkage, clamp +-- upsert to question_difficulty_calibrations +-- append to recent_events in algorithm_meta | v CalibrationVerificationGate (NEW) | +-- check calibrated_difficulty in [0.01, 0.99] +-- flag if delta > 0.30 from original +-- quarantine if Brier score deteriorating | v Health monitor caches invalidated (getHealthScaleForType will recompute next call) BACKTESTING VALIDATION FLOW ============================ CalibrationBacktestService.backtestAgainstHistory(cutoffDate) | v 1. Load all questions with >= N attempts before cutoffDate 2. Split: training set (before cutoff) vs test set (after cutoff) 3. Run calibration on training data only 4. For each question in test set: - predicted = calibrated_difficulty from training - actual = observed error rate in test period - Brier score = (predicted - actual)^2 5. Aggregate metrics: - Mean Brier score (lower = better, < 0.15 is acceptable) - Pearson correlation (predicted vs actual, > 0.4 is acceptable) - Calibration coverage (% questions with enough data) - MAE (mean absolute error, < 0.15 is acceptable) 6. PASS gate: - Pearson > 0.3 AND Mean Brier < 0.20 - If FAIL: algorithm needs tuning, do NOT enable in production | v Report: JSON/CSV output + PASS/FAIL verdict MASTERY-TO-DIFFICULTY MATCHING FLOW ==================================== Exam request (student_id + kp_codes) | v DifficultyCategoryRecommender.recommendForStudent() | +-- for each kp_code: | MasteryCalculator.calculateMasteryLevel(studentId, kp) | -> mastery [0,1], confidence, trend | +-- aggregate mastery across kp_codes (weighted average) +-- map to category: | mastery -> category via threshold table | adjust for trend: trending up -> +0.5 category push | floor at 0, cap at 4 | +-- return: recommended category + confidence + reasoning | v IntelligentExamController uses recommended category | v LearningAnalyticsService with enable_difficulty_distribution=true ``` ## Patterns to Follow ### Pattern 1: Gate-Based Progressive Activation **What:** A calibration value must pass validation before it can influence production behavior. Once validated, components progressively unlock. **When:** Any system where unvalidated statistical estimates would harm user experience. **Why this matters here:** The project explicitly requires "validation before production use." The current code already has calibration running, but it is not connected to assembly. This is correct; the backtest gate formalizes the transition. ``` Gate states: LOCKED -- calibration runs, values stored, NOT used in assembly TESTED -- backtest passed, enable for shadow mode (log but don't act) ACTIVE -- fully enabled in production assembly path ``` **Example implementation:** ```php // In a config or database table 'calibration_gate' => 'locked', // locked | tested | active // In LearningAnalyticsService, during assembly: if (config('calibration_gate') === 'active') { $questions = $resolver->applyCalibratedDifficulty($questions); } ``` ### Pattern 2: Difficulty Source Priority Chain **What:** When multiple difficulty values exist for a question, follow a deterministic priority chain rather than ad-hoc logic. **When:** Any lookup where calibrated, original, and estimated values coexist. **Why:** The current `QuestionDifficultyResolver` already implements this pattern correctly (calibrated > original). It just needs to be consistently called. ```php // Priority chain (already implemented in QuestionDifficultyResolver): // 1. calibrated_difficulty (from question_difficulty_calibrations) // 2. normalized questions.difficulty (0-1 scale, divide-by-5 if needed) // 3. fallback 0.5 (moderate default) ``` ### Pattern 3: Shadow Mode Before Activation **What:** Before enabling calibrated difficulty in actual exam assembly, run both paths in parallel and compare results without affecting output. **When:** Connecting a validated but previously-disconnected statistical system to production. **Why:** Even with backtest validation, real-time behavior may differ from historical backtest. Shadow mode catches integration bugs. ```php // In LearningAnalyticsService assembly: $rawQuestions = $selectedQuestions; // current behavior $calibratedQuestions = $resolver->applyCalibratedDifficulty($rawQuestions); // Log comparison without using calibrated values yet Log::info('Shadow mode difficulty comparison', [ 'raw_avg' => collect($rawQuestions)->avg('difficulty'), 'calibrated_avg' => collect($calibratedQuestions)->avg('difficulty'), 'diff_count' => count(array_filter($calibratedQuestions, fn($q) => ($q['difficulty_source'] ?? '') === 'calibrated' )), ]); // Use raw (unchanged behavior) until gate opens $selectedQuestions = $rawQuestions; ``` ### Pattern 4: Stratified Baseline with Residual Adjustment **What:** Already implemented in the codebase. The calibration algorithm computes expected error rates per (question_type, difficulty_category) stratum, then adjusts based on the residual (observed - expected). **When:** This is the core calibration algorithm. No changes needed to the algorithm itself per project scope. The existing algorithm is well-structured: - Global baselines: `buildGlobalBaselines()` computes per-stratum error rates - Online update: `estimateOnlineBySingleOutcome()` processes one answer event - Batch update: `estimateByStratifiedResidual()` processes historical data - Health scaling: `getHealthScaleForType()` auto-reduces step size when degrading ### Pattern 5: Time-Decay Weighted Statistics **What:** Weight recent observations more heavily than old ones using exponential decay. Already implemented with 45-day half-life. **When:** Any aggregation of student performance or calibration data. **Why:** K12 students improve; old responses are less predictive. The existing 45-day half-life is reasonable. ## Anti-Patterns to Avoid ### Anti-Pattern 1: Backfilling questions.difficulty **What:** Writing calibrated values back to `questions.difficulty`. **Why bad:** Destroys the original reference value, makes debugging impossible, violates project constraint. **Instead:** Keep the dual-table design. `questions.difficulty` is append-only immutable. `question_difficulty_calibrations` is the mutable overlay. ### Anti-Pattern 2: Global Difficulty Category Override **What:** Computing one difficulty_category for a student across all knowledge points and applying it everywhere. **Why bad:** A student may be advanced in algebra but beginner in geometry. Global category creates mismatched exams. **Instead:** Per-knowledge-point mastery -> per-knowledge-point difficulty recommendation. Aggregate only when exam spans multiple knowledge points. ### Anti-Pattern 3: Calibration Without Verification **What:** Wiring calibration directly into the assembly path without the backtest validation step. **Why bad:** If the algorithm has systematic bias (e.g., always overestimates difficulty for certain question types), it makes exams worse, not better. **Instead:** Backtest first. The backtest is a prerequisite gate, not an optional report. ### Anti-Pattern 4: Dual-Scale Leakage **What:** Mixing 0-5 scale difficulty values with 0-1 scale values in the same computation. **Why bad:** A 0-5 value of 0.4 (easy) gets treated as 0.4 on 0-1 scale (hard), producing inverted difficulty estimates. **Instead:** Normalize at the read boundary. The existing `normalizeDifficultyValue()` in `QuestionDifficultyCalibrationService` handles this for calibration input, but `LearningAnalyticsService` does not normalize when loading questions for assembly. This must be fixed. ### Anti-Pattern 5: Calibration Feedback Loop Without Rate Limiting **What:** Allowing calibration to update on every single answer without any dampening. **Why bad:** A single anomalous cohort (e.g., a class that all guesses randomly) can corrupt calibration values. **Instead:** The existing algorithm handles this well with: shrinkage (prior pulls toward original), step limits, minimum sample thresholds, and health scaling. Do not remove these safeguards. ## Scalability Considerations | Concern | At 100 questions | At 10K questions | At 100K questions | |---------|------------------|-------------------|-------------------| | Calibration table size | Negligible | ~10K rows, fast with index on `question_bank_id` | ~100K rows; add composite index on `(calibrated_difficulty, updated_at)` | | Backtest computation | < 1 second | 5-30 seconds depending on attempt count | Minutes; run as queued job, cache results | | Per-answer calibration | < 10ms (single upsert) | < 10ms | < 10ms (indexed lookup + single upsert) | | Health monitoring | Negligible (scans recent rows) | 1-5 seconds (parsing algorithm_meta JSON) | 5-30 seconds; extract health metrics to dedicated columns | | Mastery-to-category recommendation | < 50ms (1 mastery lookup) | < 50ms | < 100ms (batch mastery lookup for multiple kp_codes) | | `applyCalibratedDifficulty` batch | < 5ms | < 20ms (WHERE IN) | < 100ms; add chunking for > 1000 IDs | ## Suggested Build Order ``` Phase 1: VALIDATION (must be first -- blocks everything else) 1.1 CalibrationBacktestService - Reads historical data - Computes Pearson, Brier, MAE - Produces PASS/FAIL report DEPENDS ON: existing QuestionDifficultyCalibrationService, existing data BLOCKS: Phase 2, 3, 4 Phase 2: DIFFICULTY STANDARDIZATION (no behavioral change) 2.1 DifficultyNormalizationService - Extract and centralize normalizeDifficultyValue() logic - Apply at question-loading boundary in LearningAnalyticsService DEPENDS ON: nothing new BLOCKS: Phase 3 (need consistent scale before using calibration) Phase 3: ASSEMBLY INTEGRATION (wires calibration into production) 3.1 Wire QuestionDifficultyResolver into LearningAnalyticsService - Call applyCalibratedDifficulty() in the assembly path - Enable difficulty_distribution by default - Add shadow mode logging first, then activate 3.2 CalibrationVerificationGate - Post-upsert sanity checks - Outlier quarantine DEPENDS ON: Phase 1 (PASS gate), Phase 2 (consistent scale) BLOCKS: Phase 4 Phase 4: ADAPTIVE MATCHING (new feature) 4.1 DifficultyCategoryRecommender - Map mastery -> category - Per-kp and aggregate recommendations 4.2 Wire into IntelligentExamController - Auto-fill difficulty_category when not specified DEPENDS ON: Phase 3 (calibrated assembly working) Phase 5: HEALTH MONITORING (ongoing) 5.1 CalibrationHealthMonitor - Scheduled artisan command - Drift detection, coverage tracking - calibration_health_snapshots table 5.2 Alert logic - Flag when Brier degrades, coverage drops, drift detected DEPENDS ON: Phase 3 (need production calibration data flowing) ``` ### Dependency Graph ``` Phase 1 (Validation) | v Phase 2 (Standardization) -----> Phase 3 (Assembly Integration) | v Phase 4 (Adaptive Matching) | v Phase 5 (Health Monitoring) ``` Phase 2 can run in parallel with Phase 1 since it does not depend on validation results. It only centralizes existing normalization logic. Phase 3 requires both Phase 1 PASS and Phase 2 completion. Phases 4 and 5 are sequential after Phase 3. ## Data Model Additions ### New Table: `calibration_health_snapshots` ```sql CREATE TABLE calibration_health_snapshots ( id BIGINT UNSIGNED AUTO_INCREMENT PRIMARY KEY, snapshot_date DATE NOT NULL, total_questions INT UNSIGNED DEFAULT 0, calibrated_count INT UNSIGNED DEFAULT 0, coverage_pct DECIMAL(5,2) DEFAULT 0, avg_brier_score DECIMAL(8,6) DEFAULT NULL, avg_logloss DECIMAL(8,6) DEFAULT NULL, pearson_correlation DECIMAL(8,4) DEFAULT NULL, mean_abs_residual DECIMAL(8,4) DEFAULT NULL, health_scale_avg DECIMAL(5,3) DEFAULT NULL, drift_flag TINYINT(1) DEFAULT 0, drift_details JSON DEFAULT NULL, action VARCHAR(32) DEFAULT 'none', created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP, UNIQUE KEY idx_snapshot_date (snapshot_date) ); ``` ### New Table: `backtest_results` ```sql CREATE TABLE backtest_results ( id BIGINT UNSIGNED AUTO_INCREMENT PRIMARY KEY, run_id VARCHAR(64) NOT NULL, cutoff_date DATE NOT NULL, question_bank_id BIGINT UNSIGNED NOT NULL, training_attempts INT UNSIGNED DEFAULT 0, test_attempts INT UNSIGNED DEFAULT 0, predicted_difficulty DECIMAL(6,4) DEFAULT NULL, observed_error_rate DECIMAL(6,4) DEFAULT NULL, brier_score DECIMAL(8,6) DEFAULT NULL, absolute_error DECIMAL(6,4) DEFAULT NULL, created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP, INDEX idx_run_id (run_id), INDEX idx_cutoff (cutoff_date) ); ``` No changes needed to existing tables. The `question_difficulty_calibrations` table schema is sufficient. ## Key Design Decisions ### Decision 1: Calibration is an Overlay, Not a Replacement The calibrated difficulty is an overlay on top of the original difficulty. The `QuestionDifficultyResolver` already implements this correctly: calibrated value takes priority, original value is the fallback. This must remain the design. Never write calibrated values back to `questions.difficulty`. ### Decision 2: Gate-Based Activation, Not Feature Flags Use a deterministic gate (backtest PASS/FAIL) rather than a manual feature flag to enable calibration in production. The gate should be an artisan command that sets a config value or database flag after validation passes. This prevents human error from enabling an unvalidated algorithm. ### Decision 3: Per-Knowledge-Point Difficulty Recommendation When recommending difficulty_category for a student, compute per-knowledge-point mastery and map each to a category. If an exam covers multiple knowledge points, use the weighted average of their recommended categories, weighted by the student's weakness level (weaker knowledge points get more weight to avoid overwhelming the student). ### Decision 4: Health Monitoring is Separate from Calibration The existing `getHealthScaleForType()` already provides inline health adjustment. The new `CalibrationHealthMonitor` serves a different purpose: longitudinal tracking and alerting. It should NOT modify calibration behavior directly; instead, it produces reports that humans review to decide if algorithm parameters need adjustment. ### Decision 5: Backtesting Uses Temporal Split, Not Random Split When validating the calibration algorithm, split data by time (cutoff date) rather than randomly. This is critical because: 1. The algorithm includes time decay, so temporal ordering matters 2. Random splits would leak future information into training 3. Real deployment processes data chronologically ## Sources - Direct codebase analysis of all referenced PHP service files (HIGH confidence) - Existing `question_difficulty_calibrations` migration schema (HIGH confidence) - Adaptive testing and IRT architecture patterns from training data (MEDIUM confidence -- standard patterns in psychometrics literature) - Brier score and calibration validation approaches from training data (MEDIUM confidence -- well-established statistical methodology)