Domain: Educational assessment -- difficulty calibration validation, pipeline wiring, adaptive exam assembly Researched: 2026-04-16 Confidence: HIGH (codebase analysis + established IRT/calibration domain knowledge)
Mistakes that cause rewrites, invalid results, or systemic failure in production.
What goes wrong: Running recalibrateQuestionIds() on historical data, then computing Brier score / error rate against those same paper_questions rows, produces artificially good metrics. The algorithm was trained on this data; of course it fits.
Why it happens: The existing AnalyzeQuestionDifficultyCalibrationCommand reports calibration_gap = bank_difficulty_normalized - empirical_error_rate. This measures agreement between original difficulty and observed rates, NOT whether the calibrated difficulty predicts unseen data. The existing report gives a false sense of validation.
Consequences: Calibrated difficulty goes into production, but on new student data the predictions are no better (or worse) than the original difficulty. The entire calibration effort is wasted, and worse, it introduces noise into the exam assembly pipeline.
Prevention:
Warning signs:
Phase: Phase 1 (Calibration Validation) -- this is the single most important thing to get right.
What goes wrong: The codebase has two difficulty scales. questions.difficulty stores raw values that may be 0-1 OR 0-5 depending on when and how the question was created. The normalizeDifficultyValue() method in QuestionDifficultyCalibrationService tries to detect this (line 948: if ($raw > 1.0) { $raw = $raw / 5.0; }), but this heuristic is fragile and wrong in specific cases:
> 1.0 heuristic unreliable for distinguishing scales.Why it happens: Legacy data entry used different conventions at different times. No migration normalized the raw values when the system was built.
Consequences:
Prevention:
questions.difficulty distribution. If there is a bimodal distribution (cluster near 0-1 AND cluster near 0-5), the dual-standard problem exists.difficulty_scale column or flag to questions table indicating which scale was used, OR do a one-time migration to normalize all values to 0-1.normalizeDifficultyValue(): if the question was created before a known cutoff date, assume 0-5 scale.Warning signs:
questions.difficulty shows values like 2.0, 3.5, 4.0 (clearly 0-5 scale) mixed with 0.1, 0.5, 0.8.DifficultyDistributionService bucket counts look wrong: almost all questions in one bucket.Phase: Phase 1 (must resolve BEFORE validation can be trusted).
What goes wrong: The algorithm has a minimum weighted_attempts threshold of 8 for batch mode (line 658) and uses Bayesian shrinkage, but the online mode has NO minimum -- it always updates. Questions with 1-2 responses get calibrated difficulty that is essentially random noise, but the shrinkage prior (original difficulty) may itself be wrong (see Pitfall 2). The system then treats this calibrated value as authoritative.
Why it happens: Each new grading event triggers updateOnlineFromPaper(), which updates calibrated difficulty for every question on that paper regardless of sample size. The adaptive step limiting (maxStep) mitigates large jumps, but cumulative small biases from low-sample updates still accumulate.
Consequences:
Prevention:
sample_confidence field to the calibration table: sqrt(weighted_attempts / 80). Use this downstream.QuestionDifficultyResolver::applyCalibratedDifficulty(), only apply calibrated difficulty when weighted_attempts >= min_threshold (suggest 15-20 for production use). Fall back to original difficulty otherwise.Warning signs:
weighted_attempts across calibrated questions is below 10.attempts < 5.Phase: Phase 1 (validation must include sample size audit), Phase 2 (resolver must enforce minimum sample threshold).
What goes wrong: buildGlobalBaselines() computes expected error rates by question_type x difficulty_category from ALL paper_questions data. Then estimateByStratifiedResidual() computes residual = observed_error_rate - baseline_error_rate for individual questions. But the baseline was computed FROM those same questions' responses. This creates a regression-to-the-mean artifact:
Why it happens: Proper IRT uses student ability estimates as a conditioning variable (given student ability of theta, what is the probability of error?). This system uses difficulty_category as a proxy for student ability, but difficulty_category is assigned to the PAPER, not the student, and is itself potentially miscalibrated.
Consequences:
Prevention:
Warning signs:
Phase: Phase 1 (audit baseline computation), Phase 2 (consider alternative stratification).
What goes wrong: After validation, flipping enable_difficulty_distribution to true and having QuestionDifficultyResolver override all difficulty values in one deployment. If calibration is systematically biased in any direction, ALL exams immediately get worse.
Why it happens: The existing code path in LearningAnalyticsService has enable_difficulty_distribution as a boolean flag checked per-request. The natural "fix" is to set it to true globally. But the code has multiple assembly paths (diagnostic, practice, mistake, textbook, knowledge_points), each with different fallback behavior.
Consequences:
QuestionDifficultyResolver at all, creating inconsistency between exam types.Prevention:
Warning signs:
difficulty_source in assembled exam metadata.enable_difficulty_distribution is a single boolean controlling all paths.Phase: Phase 2 (pipeline wiring), requires feature flags and monitoring.
What goes wrong: DifficultyDistributionService::classifyQuestionByDifficulty() uses strict boundary comparisons. A question with difficulty exactly 0.25 falls into different buckets depending on category:
difficulty >= 0.25 && difficulty <= 0.5 goes to primary_mediumdifficulty >= 0 && difficulty <= 0.25 goes to primary_mediumBut a question at 0.25001 goes to primary_high for category 2. This means tiny calibration changes (0.001) can shift questions between buckets, causing the assembled exam's difficulty profile to be dramatically different.
Why it happens: Boundary values are hard-coded without hysteresis or soft boundaries. The calibrated difficulty is stored with 4 decimal places, making boundary crossings likely.
Consequences:
getSupplementOrder() fallback logic kicks in differently depending on boundary crossings, introducing non-deterministic exam composition.Prevention:
Warning signs:
groupQuestionsByDifficultyRange() returns very uneven bucket sizes.Phase: Phase 2 (when wiring distribution service), Phase 3 (when tuning difficulty_category recommendation).
What goes wrong: The 45-day half-life decay means responses older than ~6 months contribute < 3% weight. For K12 math, this creates a systematic bias: questions that were easier in earlier grades (when students were learning the concept) appear harder than they really are for the current cohort, because the only remaining data is from students who struggled recently.
Why it happens: The algorithm is designed for "dynamic" difficulty that responds to recent trends. But K12 math content has strong grade-level alignment -- a question appropriate for Grade 7 will ALWAYS be answered by Grade 7 students. The time decay doesn't differentiate between "the question got harder" and "we're seeing different students."
Consequences:
Prevention:
Warning signs:
Phase: Phase 1 (validate with and without time decay to see which produces better out-of-sample predictions).
What goes wrong: The getHealthScaleForType() method reduces the step size when recent Brier scores are worsening. But if the INITIAL calibration was wrong (bad original difficulty), then:
health_scale (multiplies step by 0.78 or 0.82).This creates a death spiral where the system becomes too cautious to ever self-correct.
Why it happens: The health monitor compares brier_after vs brier_before per event. If the calibration is already wrong, both "before" and "after" are bad, but "after" can be slightly worse due to noise. The cumulative delta being positive triggers the 0.78 multiplier.
Consequences:
Prevention:
Warning signs:
Phase: Phase 1 (audit health monitor behavior during validation), Phase 2 (tune parameters before production).
What goes wrong: The batch mode (estimateByStratifiedResidual) and online mode (estimateOnlineBySingleOutcome) use different logic:
maxStep = 0.30 * (0.35 + 0.65 * confidence) * healthScale. At weighted_attempts = 1, confidence = 0.0125, giving maxStep ~ 0.11 -- a non-trivial adjustment from a single data point.A question that gets batch-recalibrated with 7 responses gets NO adjustment. The same question that gets 7 online updates gets 7 incremental adjustments. The final calibrated difficulty can differ significantly.
Prevention: Align the minimum sample thresholds between modes. Either both modes should require 8+ weighted attempts before any adjustment, or both should allow incremental updates. Document which mode is the source of truth.
Phase: Phase 1 (resolve during validation -- both modes should produce similar results on the same data).
What goes wrong: DifficultyDistributionService defines narrow ranges for each category. Category 0 requires 90% of questions with difficulty 0-0.1. If the question pool for a given knowledge point + question_type has very few questions in that range, the assembly either:
getSupplementOrder() and fills with "other" bucket questions (defeating the difficulty targeting), orPrevention: Before enabling difficulty distribution, analyze the question pool by knowledge point to verify sufficient coverage at each difficulty level. If coverage is sparse, either:
Phase: Phase 2 (before enabling enable_difficulty_distribution by default).
What goes wrong: Choice questions (multiple choice) have a baseline ~25% correct-by-guessing rate. Fill-in-the-blank and open-ended questions have no guessing bonus. The calibration algorithm treats "is_correct" the same across all types. A choice question with calibrated difficulty 0.5 does NOT have the same "true difficulty" as a fill-in question with calibrated difficulty 0.5.
Prevention:
Phase: Phase 3 (when building the mastery-to-difficulty_category recommendation).
What goes wrong: The project plans to map student mastery (0-1 continuous value from MasteryCalculator) to difficulty_category (0-4 discrete levels). Without empirical validation, the mapping will be based on intuition. Common mistakes:
Prevention:
Phase: Phase 3 (mastery-to-difficulty recommendation).
What goes wrong: Once calibrated difficulty drives exam assembly, the calibration also absorbs the outcomes of those exams. If the calibration overestimates difficulty (marks questions as harder than they are), the system assigns them to higher difficulty_category exams where students are stronger. Stronger students answer correctly, causing the calibration to further lower the difficulty. The system oscillates.
Prevention:
Phase: Phase 2 (monitor after wiring), Phase 3 (add student ability as covariate).
What goes wrong: The algorithm_meta JSON column stores recent_events (up to 30 events per question). With thousands of calibrated questions, this column grows rapidly. Each online update reads and re-writes the full JSON. Over months, this table becomes the largest in the database by storage, and queries slow down.
Prevention: Move recent_events to a separate table (one row per event) or cap the JSON size more aggressively. Consider dropping event-level detail after 30 days and keeping only aggregated metrics.
Phase: Phase 2 (before heavy production use of online mode).
What goes wrong: Two simultaneous grading events for the same question (e.g., two students submit papers at the same time) both read the same prev_difficulty from the calibration table, compute their updates independently, and the second write overwrites the first. The net effect is that one update is lost.
The current upsert on question_bank_id is atomic at the row level, but the read-compute-write cycle in updateOnlineFromPaper() is NOT atomic. Between reading existing (line 116) and writing upserts (line 212), another process can update the same row.
Prevention: Use SELECT ... FOR UPDATE or database-level locks when reading existing calibration data for questions that are about to be updated. Alternatively, use an incremental approach: UPDATE ... SET weighted_attempts = weighted_attempts + X, weighted_wrong = weighted_wrong + Y instead of computing the new values in PHP.
Phase: Phase 2 (before production deployment of online mode at scale).
What goes wrong: Questions without a set difficulty default to 0.5 in hydrateQuestions() (line 723: 'difficulty' => isset($question['difficulty']) ? (float) $question['difficulty'] : 0.5). When these questions enter calibration, original_difficulty becomes 0.5. With Bayesian shrinkage toward 0.5 (the Beta(2,2) prior mode), these questions' calibrated difficulty will be pulled toward 0.5 regardless of actual difficulty.
Prevention: Distinguish between "explicitly set to 0.5" and "unset, defaulted to 0.5." Only apply shrinkage toward the prior for questions where the original difficulty was explicitly set. For unset questions, use the empirical error rate directly (with wider confidence intervals).
Phase: Phase 1 (data audit should count questions with default difficulty).
| Phase Topic | Likely Pitfall | Mitigation | Severity |
|---|---|---|---|
| Calibration validation | Circular validation (Pitfall 1) | Temporal train/test split | CRITICAL |
| Data audit | Dual difficulty standard (Pitfall 2) | One-time normalization + flag column | CRITICAL |
| Data audit | Low sample size (Pitfall 3) | Report sample distribution; set minimum threshold | CRITICAL |
| Algorithm audit | Baseline self-reference (Pitfall 4) | Leave-one-out baselines; audit stratum sizes | HIGH |
| Algorithm audit | Time decay appropriateness (Pitfall 7) | Compare with/without decay in validation | HIGH |
| Algorithm audit | Health monitor degeneracy (Pitfall 8) | Raise floor to 0.6; add reset mechanism | HIGH |
| Pipeline wiring | No A/B testing (Pitfall 5) | Feature flag per assembly type; monitor metrics | CRITICAL |
| Pipeline wiring | Boundary effects (Pitfall 6) | Soft boundaries or buffer zones | HIGH |
| Pipeline wiring | Mode inconsistency (Pitfall 9) | Align thresholds between batch and online | MEDIUM |
| Pipeline wiring | Pool exhaustion (Pitfall 10) | Pre-analyze coverage; log exhaustion events | MEDIUM |
| Pipeline wiring | Race condition (Pitfall 15) | Row-level locks or incremental updates | MEDIUM |
| Difficulty recommendation | Type heterogeneity (Pitfall 11) | Guessing correction for choice questions | MEDIUM |
| Difficulty recommendation | Mastery mapping without ground truth (Pitfall 12) | Use historical data to find optimal mapping | HIGH |
| Ongoing operations | Feedback loop divergence (Pitfall 13) | Track drift; periodic expert anchoring | HIGH |
| Ongoing operations | JSON bloat (Pitfall 14) | Separate events table or aggressive capping | LOW |
| Data quality | Default 0.5 contamination (Pitfall 16) | Distinguish set vs unset difficulty | LOW |
questions.difficulty distribution for dual-standard evidenceQuestionDifficultyCalibrationService.php, QuestionDifficultyResolver.php, DifficultyDistributionService.php, IntelligentExamController.php, LearningAnalyticsService.php