Architecture Patterns

Domain: K12 math difficulty calibration and intelligent exam matching Researched: 2026-04-16 Confidence: HIGH (based on direct codebase analysis); MEDIUM (general adaptive testing patterns from training data)

Current Architecture (As-Built)

                           ANSWER FLOW (already working)
                           ============================

  Student answers exam
         |
         v
  ExamAnswerAnalysisService.analyzeExamAnswers()
         |
         +---> MasteryCalculator (knowledge point mastery)
         +---> KnowledgeMasteryService (persist mastery)
         +---> LocalAIAnalysisService (update mastery)
         +---> MistakeBookService (add to mistake book)
         +---> QuestionDifficultyCalibrationService.updateOnlineFromPaper()
                  |
                  v
         question_difficulty_calibrations table (upsert)


                           ASSEMBLY FLOW (partially working)
                           ================================

  POST /api/intelligent-exam
         |
         v
  IntelligentExamController.store()
         |
         v
  AssembleExamTaskJob (queued)
         |
         v
  LearningAnalyticsService.generateIntelligentExam()
         |
         +---> selectQuestions() -- uses raw questions.difficulty
         +---> applyTypeAwareDifficultyDistribution()
         |         |
         |         v
         |    DifficultyDistributionService
         |    (only if enable_difficulty_distribution=true,
         |     which defaults to FALSE)
         |
         v
  QuestionDifficultyResolver.applyCalibratedDifficulty()
  (exists but NOT called in the main assembly path)

Identified Gaps in Current Architecture

Gap	Location	Impact
Calibration values not used in assembly	`LearningAnalyticsService` selects questions using raw `questions.difficulty`	Assembled exams use uncalibrated difficulty
`enable_difficulty_distribution` defaults false	`LearningAnalyticsService` line 1554	Distribution strategy never activates unless caller explicitly enables
No auto difficulty_category recommendation	No service maps mastery to category	Teachers must manually pick tier; no student-level adaptation
No backtesting validation	`QuestionDifficultyCalibrationAnalyzer` reports but does not validate	Algorithm accuracy unknown before production use
Dual difficulty scale (0-1 vs 0-5)	`normalizeDifficultyValue()` divides by 5 if > 1.0	Inconsistent source data enters calibration

Recommended Target Architecture

Component Diagram

+===================================================================================+
|                              TARGET ARCHITECTURE                                   |
+===================================================================================+

 LAYER 1: VALIDATION (must complete before anything else)
 +--------------------------------------------------------+
 |                                                        |
 |  CalibrationBacktestService                            |
 |    |-- backtestAgainstHistory(cutoffDate)              |
 |    |-- computeBrierScores(questionIds)                 |
 |    |-- computePearsonCorrelation()                     |
 |    |-- produceValidationReport()                       |
 |    +-- PASS/FAIL gate: algo accuracy threshold         |
 |                                                        |
 |  Data: reads paper_questions + questions (historical)  |
 |  Writes: backtest_results table (or JSON export)       |
 +--------------------------------------------------------+
           |
           | PASS gate opens production use
           v

 LAYER 2: CALIBRATION FEEDBACK LOOP (enhance existing)
 +--------------------------------------------------------+
 |                                                        |
 |  ExamAnswerAnalysisService                            |
 |    |-- (existing) analyzeExamAnswers()                 |
 |    +-- (existing) recalibrateQuestionDifficulty()      |
 |                                                        |
 |  QuestionDifficultyCalibrationService                  |
 |    |-- updateOnlineFromPaper()   [existing, per-paper] |
 |    |-- recalibrateQuestionIds()  [existing, batch]     |
 |    +-- getHealthScaleForType()   [existing, monitoring]|
 |                                                        |
 |  NEW: CalibrationVerificationGate                      |
 |    |-- validateCalibratedRange(questionIds)             |
 |    |-- flagOutliers(threshold)                          |
 |    +-- quarantineBadCalibrations()                      |
 |                                                        |
 |  Data: answer --> calibrate --> verify --> use         |
 +--------------------------------------------------------+
           |
           v

 LAYER 3: ASSEMBLY INTEGRATION (connect calibration to exam)
 +--------------------------------------------------------+
 |                                                        |
 |  DifficultyNormalizationService  [NEW]                 |
 |    |-- normalize(questionId) -> float [0,1]            |
 |    |-- batchNormalize(questionIds) -> map              |
 |    +-- resolves 0-1 vs 0-5 ambiguity at read time     |
 |                                                        |
 |  QuestionDifficultyResolver  [existing, expand usage]  |
 |    |-- applyCalibratedDifficulty(questions) -> arr     |
 |    +-- MUST be called in assembly path                 |
 |                                                        |
 |  LearningAnalyticsService                              |
 |    |-- generateIntelligentExam()                       |
 |    |   +-- CALL DifficultyNormalizationService first    |
 |    |   +-- CALL QuestionDifficultyResolver second       |
 |    |   +-- SET enable_difficulty_distribution = true    |
 |    +-- remove hard-coded default false                 |
 |                                                        |
 |  DifficultyDistributionService  [existing]             |
 |    |-- calculateDistribution(category, total)          |
 |    +-- groupQuestionsByDifficultyRange()               |
 +--------------------------------------------------------+
           |
           v

 LAYER 4: ADAPTIVE MATCHING (mastery-based difficulty selection)
 +--------------------------------------------------------+
 |                                                        |
 |  DifficultyCategoryRecommender  [NEW]                  |
 |    |-- recommendForStudent(studentId, kpCodes) -> cat  |
 |    |-- recommendForKnowledgePoint(studentId, kp) -> cat|
 |    +-- uses MasteryCalculator + calibration data       |
 |                                                        |
 |  MasteryCalculator  [existing]                         |
 |    |-- calculateMasteryLevel(studentId, kpCode)        |
 |    +-- returns mastery [0,1] + confidence + trend      |
 |                                                        |
 |  Mapping logic:                                        |
 |    mastery [0.0, 0.30) -> category 0 (zero-foundation) |
 |    mastery [0.30, 0.50) -> category 1 (foundation)     |
 |    mastery [0.50, 0.70) -> category 2 (intermediate)   |
 |    mastery [0.70, 0.85) -> category 3 (advanced)       |
 |    mastery [0.85, 1.00) -> category 4 (competition)    |
 +--------------------------------------------------------+
           |
           v

 LAYER 5: HEALTH MONITORING (continuous)
 +--------------------------------------------------------+
 |                                                        |
 |  CalibrationHealthMonitor  [NEW]                       |
 |    |-- detectDrift(windowDays) -> drift report         |
 |    |-- accuracyTrend(days) -> accuracy over time       |
 |    |-- calibrationCoverage() -> % questions calibrated |
 |    +-- scheduled artisan command (daily/weekly)        |
 |                                                        |
 |  Existing health mechanisms:                           |
 |    |-- getHealthScaleForType() in CalibrationService   |
 |    +-- recent_events in algorithm_meta (per-question)  |
 |                                                        |
 |  NEW: calibration_health_snapshots table               |
 |    |-- date, total_calibrated, avg_brier,              |
 |    |   coverage_pct, drift_flag, action                |
 +--------------------------------------------------------+

Component Boundaries

Component	Responsibility	Communicates With	New/Existing
`CalibrationBacktestService`	Validate algorithm accuracy against historical data	Reads `paper_questions`, `questions`, `papers`. Writes report output.	NEW
`QuestionDifficultyCalibrationService`	Core calibration algorithm (stratified_residual_eb_v2)	Called by `ExamAnswerAnalysisService`, `CalibrationBacktestService`	EXISTING
`CalibrationVerificationGate`	Post-calibration sanity checks (range, outlier detection)	Reads `question_difficulty_calibrations`. Flags problematic entries.	NEW
`DifficultyNormalizationService`	Unify 0-1 / 0-5 scale at read boundary	Called by `LearningAnalyticsService` during question loading	NEW
`QuestionDifficultyResolver`	Apply calibrated difficulty to question arrays, calibrated-first	Called in assembly path by `LearningAnalyticsService`	EXISTING (needs wiring)
`DifficultyDistributionService`	Calculate difficulty buckets per category	Called by `LearningAnalyticsService` when distribution enabled	EXISTING
`DifficultyCategoryRecommender`	Map student mastery to recommended difficulty category	Reads `MasteryCalculator`. Used by `IntelligentExamController`	NEW
`MasteryCalculator`	Calculate per-knowledge-point mastery levels	Existing, unchanged	EXISTING
`CalibrationHealthMonitor`	Detect calibration drift, coverage gaps, accuracy degradation	Reads `question_difficulty_calibrations`. Writes health snapshots.	NEW
`LearningAnalyticsService`	Orchestrate question selection and difficulty distribution	Must call normalization + resolver + distribution	EXISTING (needs modification)

Data Flow

QUESTION DIFFICULTY LIFECYCLE
=============================

  questions.difficulty (original, immutable)
         |
         v
  DifficultyNormalizationService.normalize()
         |  (resolves 0-1 vs 0-5, stores original_difficulty)
         v
  question_difficulty_calibrations.original_difficulty
         |
         +---[calibration loop]---> calibrated_difficulty
         |                              |
         v                              v
  QuestionDifficultyResolver.applyCalibratedDifficulty()
         |
         |  Returns: calibrated if exists, else original (normalized)
         v
  DifficultyDistributionService.groupQuestionsByDifficultyRange()
         |
         v
  Selected questions for exam assembly


ANSWER-TO-CALIBRATION FEEDBACK LOOP
====================================

  Student submits exam answers
         |
         v
  ExamAnswerAnalysisService.analyzeExamAnswers()
         |
         +---> MasteryCalculator (update knowledge mastery)
         +---> QuestionDifficultyCalibrationService.updateOnlineFromPaper()
                  |
                  +-- per-question: compute residual, apply shrinkage, clamp
                  +-- upsert to question_difficulty_calibrations
                  +-- append to recent_events in algorithm_meta
                  |
                  v
         CalibrationVerificationGate (NEW)
                  |
                  +-- check calibrated_difficulty in [0.01, 0.99]
                  +-- flag if delta > 0.30 from original
                  +-- quarantine if Brier score deteriorating
                  |
                  v
         Health monitor caches invalidated
         (getHealthScaleForType will recompute next call)


BACKTESTING VALIDATION FLOW
============================

  CalibrationBacktestService.backtestAgainstHistory(cutoffDate)
         |
         v
  1. Load all questions with >= N attempts before cutoffDate
  2. Split: training set (before cutoff) vs test set (after cutoff)
  3. Run calibration on training data only
  4. For each question in test set:
     - predicted = calibrated_difficulty from training
     - actual = observed error rate in test period
     - Brier score = (predicted - actual)^2
  5. Aggregate metrics:
     - Mean Brier score (lower = better, < 0.15 is acceptable)
     - Pearson correlation (predicted vs actual, > 0.4 is acceptable)
     - Calibration coverage (% questions with enough data)
     - MAE (mean absolute error, < 0.15 is acceptable)
  6. PASS gate:
     - Pearson > 0.3 AND Mean Brier < 0.20
     - If FAIL: algorithm needs tuning, do NOT enable in production
         |
         v
  Report: JSON/CSV output + PASS/FAIL verdict


MASTERY-TO-DIFFICULTY MATCHING FLOW
====================================

  Exam request (student_id + kp_codes)
         |
         v
  DifficultyCategoryRecommender.recommendForStudent()
         |
         +-- for each kp_code:
         |    MasteryCalculator.calculateMasteryLevel(studentId, kp)
         |    -> mastery [0,1], confidence, trend
         |
         +-- aggregate mastery across kp_codes (weighted average)
         +-- map to category:
         |    mastery -> category via threshold table
         |    adjust for trend: trending up -> +0.5 category push
         |    floor at 0, cap at 4
         |
         +-- return: recommended category + confidence + reasoning
         |
         v
  IntelligentExamController uses recommended category
         |
         v
  LearningAnalyticsService with enable_difficulty_distribution=true

Patterns to Follow

Pattern 1: Gate-Based Progressive Activation

What: A calibration value must pass validation before it can influence production behavior. Once validated, components progressively unlock. When: Any system where unvalidated statistical estimates would harm user experience. Why this matters here: The project explicitly requires "validation before production use." The current code already has calibration running, but it is not connected to assembly. This is correct; the backtest gate formalizes the transition.

Gate states:
  LOCKED   -- calibration runs, values stored, NOT used in assembly
  TESTED   -- backtest passed, enable for shadow mode (log but don't act)
  ACTIVE   -- fully enabled in production assembly path

Example implementation:

// In a config or database table
'calibration_gate' => 'locked',  // locked | tested | active

// In LearningAnalyticsService, during assembly:
if (config('calibration_gate') === 'active') {
    $questions = $resolver->applyCalibratedDifficulty($questions);
}

Pattern 2: Difficulty Source Priority Chain

What: When multiple difficulty values exist for a question, follow a deterministic priority chain rather than ad-hoc logic. When: Any lookup where calibrated, original, and estimated values coexist. Why: The current QuestionDifficultyResolver already implements this pattern correctly (calibrated > original). It just needs to be consistently called.

// Priority chain (already implemented in QuestionDifficultyResolver):
// 1. calibrated_difficulty (from question_difficulty_calibrations)
// 2. normalized questions.difficulty (0-1 scale, divide-by-5 if needed)
// 3. fallback 0.5 (moderate default)

Pattern 3: Shadow Mode Before Activation

What: Before enabling calibrated difficulty in actual exam assembly, run both paths in parallel and compare results without affecting output. When: Connecting a validated but previously-disconnected statistical system to production. Why: Even with backtest validation, real-time behavior may differ from historical backtest. Shadow mode catches integration bugs.

// In LearningAnalyticsService assembly:
$rawQuestions = $selectedQuestions; // current behavior
$calibratedQuestions = $resolver->applyCalibratedDifficulty($rawQuestions);

// Log comparison without using calibrated values yet
Log::info('Shadow mode difficulty comparison', [
    'raw_avg' => collect($rawQuestions)->avg('difficulty'),
    'calibrated_avg' => collect($calibratedQuestions)->avg('difficulty'),
    'diff_count' => count(array_filter($calibratedQuestions, fn($q) =>
        ($q['difficulty_source'] ?? '') === 'calibrated'
    )),
]);

// Use raw (unchanged behavior) until gate opens
$selectedQuestions = $rawQuestions;

Pattern 4: Stratified Baseline with Residual Adjustment

What: Already implemented in the codebase. The calibration algorithm computes expected error rates per (question_type, difficulty_category) stratum, then adjusts based on the residual (observed - expected). When: This is the core calibration algorithm. No changes needed to the algorithm itself per project scope.

The existing algorithm is well-structured:

Global baselines: buildGlobalBaselines() computes per-stratum error rates
Online update: estimateOnlineBySingleOutcome() processes one answer event
Batch update: estimateByStratifiedResidual() processes historical data
Health scaling: getHealthScaleForType() auto-reduces step size when degrading

Pattern 5: Time-Decay Weighted Statistics

What: Weight recent observations more heavily than old ones using exponential decay. Already implemented with 45-day half-life. When: Any aggregation of student performance or calibration data. Why: K12 students improve; old responses are less predictive. The existing 45-day half-life is reasonable.

Anti-Patterns to Avoid

Anti-Pattern 1: Backfilling questions.difficulty

What: Writing calibrated values back to questions.difficulty. Why bad: Destroys the original reference value, makes debugging impossible, violates project constraint. Instead: Keep the dual-table design. questions.difficulty is append-only immutable. question_difficulty_calibrations is the mutable overlay.

Anti-Pattern 2: Global Difficulty Category Override

What: Computing one difficulty_category for a student across all knowledge points and applying it everywhere. Why bad: A student may be advanced in algebra but beginner in geometry. Global category creates mismatched exams. Instead: Per-knowledge-point mastery -> per-knowledge-point difficulty recommendation. Aggregate only when exam spans multiple knowledge points.

Anti-Pattern 3: Calibration Without Verification

What: Wiring calibration directly into the assembly path without the backtest validation step. Why bad: If the algorithm has systematic bias (e.g., always overestimates difficulty for certain question types), it makes exams worse, not better. Instead: Backtest first. The backtest is a prerequisite gate, not an optional report.

Anti-Pattern 4: Dual-Scale Leakage

What: Mixing 0-5 scale difficulty values with 0-1 scale values in the same computation. Why bad: A 0-5 value of 0.4 (easy) gets treated as 0.4 on 0-1 scale (hard), producing inverted difficulty estimates. Instead: Normalize at the read boundary. The existing normalizeDifficultyValue() in QuestionDifficultyCalibrationService handles this for calibration input, but LearningAnalyticsService does not normalize when loading questions for assembly. This must be fixed.

Anti-Pattern 5: Calibration Feedback Loop Without Rate Limiting

What: Allowing calibration to update on every single answer without any dampening. Why bad: A single anomalous cohort (e.g., a class that all guesses randomly) can corrupt calibration values. Instead: The existing algorithm handles this well with: shrinkage (prior pulls toward original), step limits, minimum sample thresholds, and health scaling. Do not remove these safeguards.

Scalability Considerations

Concern	At 100 questions	At 10K questions	At 100K questions
Calibration table size	Negligible	~10K rows, fast with index on `question_bank_id`	~100K rows; add composite index on `(calibrated_difficulty, updated_at)`
Backtest computation	< 1 second	5-30 seconds depending on attempt count	Minutes; run as queued job, cache results
Per-answer calibration	< 10ms (single upsert)	< 10ms	< 10ms (indexed lookup + single upsert)
Health monitoring	Negligible (scans recent rows)	1-5 seconds (parsing algorithm_meta JSON)	5-30 seconds; extract health metrics to dedicated columns
Mastery-to-category recommendation	< 50ms (1 mastery lookup)	< 50ms	< 100ms (batch mastery lookup for multiple kp_codes)
`applyCalibratedDifficulty` batch	< 5ms	< 20ms (WHERE IN)	< 100ms; add chunking for > 1000 IDs

Suggested Build Order

Phase 1: VALIDATION (must be first -- blocks everything else)
  1.1  CalibrationBacktestService
       - Reads historical data
       - Computes Pearson, Brier, MAE
       - Produces PASS/FAIL report
  DEPENDS ON: existing QuestionDifficultyCalibrationService, existing data
  BLOCKS: Phase 2, 3, 4

Phase 2: DIFFICULTY STANDARDIZATION (no behavioral change)
  2.1  DifficultyNormalizationService
       - Extract and centralize normalizeDifficultyValue() logic
       - Apply at question-loading boundary in LearningAnalyticsService
  DEPENDS ON: nothing new
  BLOCKS: Phase 3 (need consistent scale before using calibration)

Phase 3: ASSEMBLY INTEGRATION (wires calibration into production)
  3.1  Wire QuestionDifficultyResolver into LearningAnalyticsService
       - Call applyCalibratedDifficulty() in the assembly path
       - Enable difficulty_distribution by default
       - Add shadow mode logging first, then activate
  3.2  CalibrationVerificationGate
       - Post-upsert sanity checks
       - Outlier quarantine
  DEPENDS ON: Phase 1 (PASS gate), Phase 2 (consistent scale)
  BLOCKS: Phase 4

Phase 4: ADAPTIVE MATCHING (new feature)
  4.1  DifficultyCategoryRecommender
       - Map mastery -> category
       - Per-kp and aggregate recommendations
  4.2  Wire into IntelligentExamController
       - Auto-fill difficulty_category when not specified
  DEPENDS ON: Phase 3 (calibrated assembly working)

Phase 5: HEALTH MONITORING (ongoing)
  5.1  CalibrationHealthMonitor
       - Scheduled artisan command
       - Drift detection, coverage tracking
       - calibration_health_snapshots table
  5.2  Alert logic
       - Flag when Brier degrades, coverage drops, drift detected
  DEPENDS ON: Phase 3 (need production calibration data flowing)

Dependency Graph

Phase 1 (Validation)
    |
    v
Phase 2 (Standardization) -----> Phase 3 (Assembly Integration)
                                       |
                                       v
                                Phase 4 (Adaptive Matching)
                                       |
                                       v
                                Phase 5 (Health Monitoring)

Phase 2 can run in parallel with Phase 1 since it does not depend on validation results. It only centralizes existing normalization logic. Phase 3 requires both Phase 1 PASS and Phase 2 completion. Phases 4 and 5 are sequential after Phase 3.

Data Model Additions

New Table: `calibration_health_snapshots`

CREATE TABLE calibration_health_snapshots (
    id BIGINT UNSIGNED AUTO_INCREMENT PRIMARY KEY,
    snapshot_date DATE NOT NULL,
    total_questions INT UNSIGNED DEFAULT 0,
    calibrated_count INT UNSIGNED DEFAULT 0,
    coverage_pct DECIMAL(5,2) DEFAULT 0,
    avg_brier_score DECIMAL(8,6) DEFAULT NULL,
    avg_logloss DECIMAL(8,6) DEFAULT NULL,
    pearson_correlation DECIMAL(8,4) DEFAULT NULL,
    mean_abs_residual DECIMAL(8,4) DEFAULT NULL,
    health_scale_avg DECIMAL(5,3) DEFAULT NULL,
    drift_flag TINYINT(1) DEFAULT 0,
    drift_details JSON DEFAULT NULL,
    action VARCHAR(32) DEFAULT 'none',
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    UNIQUE KEY idx_snapshot_date (snapshot_date)
);

New Table: `backtest_results`

CREATE TABLE backtest_results (
    id BIGINT UNSIGNED AUTO_INCREMENT PRIMARY KEY,
    run_id VARCHAR(64) NOT NULL,
    cutoff_date DATE NOT NULL,
    question_bank_id BIGINT UNSIGNED NOT NULL,
    training_attempts INT UNSIGNED DEFAULT 0,
    test_attempts INT UNSIGNED DEFAULT 0,
    predicted_difficulty DECIMAL(6,4) DEFAULT NULL,
    observed_error_rate DECIMAL(6,4) DEFAULT NULL,
    brier_score DECIMAL(8,6) DEFAULT NULL,
    absolute_error DECIMAL(6,4) DEFAULT NULL,
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    INDEX idx_run_id (run_id),
    INDEX idx_cutoff (cutoff_date)
);

No changes needed to existing tables. The question_difficulty_calibrations table schema is sufficient.

Key Design Decisions

Decision 1: Calibration is an Overlay, Not a Replacement

The calibrated difficulty is an overlay on top of the original difficulty. The QuestionDifficultyResolver already implements this correctly: calibrated value takes priority, original value is the fallback. This must remain the design. Never write calibrated values back to questions.difficulty.

Decision 2: Gate-Based Activation, Not Feature Flags

Use a deterministic gate (backtest PASS/FAIL) rather than a manual feature flag to enable calibration in production. The gate should be an artisan command that sets a config value or database flag after validation passes. This prevents human error from enabling an unvalidated algorithm.

Decision 3: Per-Knowledge-Point Difficulty Recommendation

When recommending difficulty_category for a student, compute per-knowledge-point mastery and map each to a category. If an exam covers multiple knowledge points, use the weighted average of their recommended categories, weighted by the student's weakness level (weaker knowledge points get more weight to avoid overwhelming the student).

Decision 4: Health Monitoring is Separate from Calibration

The existing getHealthScaleForType() already provides inline health adjustment. The new CalibrationHealthMonitor serves a different purpose: longitudinal tracking and alerting. It should NOT modify calibration behavior directly; instead, it produces reports that humans review to decide if algorithm parameters need adjustment.

Decision 5: Backtesting Uses Temporal Split, Not Random Split

When validating the calibration algorithm, split data by time (cutoff date) rather than randomly. This is critical because:

The algorithm includes time decay, so temporal ordering matters
Random splits would leak future information into training
Real deployment processes data chronologically

Sources

Direct codebase analysis of all referenced PHP service files (HIGH confidence)
Existing question_difficulty_calibrations migration schema (HIGH confidence)
Adaptive testing and IRT architecture patterns from training data (MEDIUM confidence -- standard patterns in psychometrics literature)
Brier score and calibration validation approaches from training data (MEDIUM confidence -- well-established statistical methodology)

ARCHITECTURE.md 27 KB 文件歷史 原始文件