Domain Pitfalls: K12 Math Difficulty Calibration & Intelligent Exam

Domain: Educational assessment -- difficulty calibration validation, pipeline wiring, adaptive exam assembly Researched: 2026-04-16 Confidence: HIGH (codebase analysis + established IRT/calibration domain knowledge)

Critical Pitfalls

Mistakes that cause rewrites, invalid results, or systemic failure in production.

Pitfall 1: Validating Calibration Against the Same Data That Produced It (Circular Validation)

What goes wrong: Running recalibrateQuestionIds() on historical data, then computing Brier score / error rate against those same paper_questions rows, produces artificially good metrics. The algorithm was trained on this data; of course it fits.

Why it happens: The existing AnalyzeQuestionDifficultyCalibrationCommand reports calibration_gap = bank_difficulty_normalized - empirical_error_rate. This measures agreement between original difficulty and observed rates, NOT whether the calibrated difficulty predicts unseen data. The existing report gives a false sense of validation.

Consequences: Calibrated difficulty goes into production, but on new student data the predictions are no better (or worse) than the original difficulty. The entire calibration effort is wasted, and worse, it introduces noise into the exam assembly pipeline.

Prevention:

Split historical data temporally: calibrate on data before a cutoff date, validate Brier score / log-loss on data after the cutoff.
Use walk-forward validation: for each month, calibrate using all prior data, predict the next month's outcomes.
The metric that matters is "does calibrated difficulty predict out-of-sample error rate better than original difficulty?" -- not in-sample fit.

Warning signs:

Validation Brier score is suspiciously good (below 0.15 for a heterogeneous K12 population).
No mention of train/test split in the validation plan.
Validation is done by running the existing report command with different flags on the same dataset.

Phase: Phase 1 (Calibration Validation) -- this is the single most important thing to get right.

Pitfall 2: Dual Difficulty Standard (0-1 vs 0-5) Silent Corruption

What goes wrong: The codebase has two difficulty scales. questions.difficulty stores raw values that may be 0-1 OR 0-5 depending on when and how the question was created. The normalizeDifficultyValue() method in QuestionDifficultyCalibrationService tries to detect this (line 948: if ($raw > 1.0) { $raw = $raw / 5.0; }), but this heuristic is fragile and wrong in specific cases:

A difficulty of exactly 1.0 is ambiguous: is it "maximum on 0-1 scale" or "minimum on 0-5 scale"? The code treats it as 0-1, which may be wrong.
A difficulty of 0.6 on the 0-5 scale would NOT trigger normalization (0.6 < 1.0), but the true normalized value should be 0.12.
The 0-5 values are not always integers -- some may be 2.5, 3.7, etc., making the > 1.0 heuristic unreliable for distinguishing scales.

Why it happens: Legacy data entry used different conventions at different times. No migration normalized the raw values when the system was built.

Consequences:

Questions with 0-5 raw difficulty that happen to be < 1.0 get treated as if they are on the 0-1 scale. A question rated "2 out of 5" (should normalize to 0.4) stays at 0.2 if it was entered as a float.
Calibration baselines computed from mixed-scale data are meaningless because the "original difficulty" column contains inconsistent scales.
DifficultyDistributionService bucket boundaries (0.25, 0.5, 0.75) misclassify questions.

Prevention:

Before ANY validation or calibration, audit the questions.difficulty distribution. If there is a bimodal distribution (cluster near 0-1 AND cluster near 0-5), the dual-standard problem exists.
Add a difficulty_scale column or flag to questions table indicating which scale was used, OR do a one-time migration to normalize all values to 0-1.
Add a guard in normalizeDifficultyValue(): if the question was created before a known cutoff date, assume 0-5 scale.
Log when normalization triggers so you can count how many questions hit the ambiguous zone.

Warning signs:

Distribution of questions.difficulty shows values like 2.0, 3.5, 4.0 (clearly 0-5 scale) mixed with 0.1, 0.5, 0.8.
Calibrated difficulty delta is systematically positive or negative for a batch of questions (suggesting original difficulty was systematically wrong).
DifficultyDistributionService bucket counts look wrong: almost all questions in one bucket.

Phase: Phase 1 (must resolve BEFORE validation can be trusted).

Pitfall 3: Low Sample Size Questions Dominating Calibration Results

What goes wrong: The algorithm has a minimum weighted_attempts threshold of 8 for batch mode (line 658) and uses Bayesian shrinkage, but the online mode has NO minimum -- it always updates. Questions with 1-2 responses get calibrated difficulty that is essentially random noise, but the shrinkage prior (original difficulty) may itself be wrong (see Pitfall 2). The system then treats this calibrated value as authoritative.

Why it happens: Each new grading event triggers updateOnlineFromPaper(), which updates calibrated difficulty for every question on that paper regardless of sample size. The adaptive step limiting (maxStep) mitigates large jumps, but cumulative small biases from low-sample updates still accumulate.

Consequences:

A question answered correctly once by a strong student gets a calibrated difficulty lower than it should be.
When this question is then used in exam assembly, it may be classified in the wrong difficulty bucket.
If the student later answers it incorrectly, the difficulty swings back. Oscillation without convergence.

Prevention:

Add a sample_confidence field to the calibration table: sqrt(weighted_attempts / 80). Use this downstream.
In QuestionDifficultyResolver::applyCalibratedDifficulty(), only apply calibrated difficulty when weighted_attempts >= min_threshold (suggest 15-20 for production use). Fall back to original difficulty otherwise.
Track the percentage of questions in the pool that have sufficient calibration data. If < 30%, the calibration system cannot be trusted for exam assembly.
Report sample size distribution alongside calibration metrics.

Warning signs:

Median weighted_attempts across calibrated questions is below 10.
Large fraction (>40%) of questions in the calibration table have attempts < 5.
Difficulty delta distribution has heavy tails (many questions moving by > 0.2).

Phase: Phase 1 (validation must include sample size audit), Phase 2 (resolver must enforce minimum sample threshold).

Pitfall 4: Stratified Baseline Computed From the Same Population Being Calibrated

What goes wrong: buildGlobalBaselines() computes expected error rates by question_type x difficulty_category from ALL paper_questions data. Then estimateByStratifiedResidual() computes residual = observed_error_rate - baseline_error_rate for individual questions. But the baseline was computed FROM those same questions' responses. This creates a regression-to-the-mean artifact:

Questions whose observed error rate contributed to a high baseline will, by construction, tend to have smaller residuals than they "should."
If a question_type has few questions, one question's outliers heavily influence the baseline for that type, biasing residuals for all questions of that type.

Why it happens: Proper IRT uses student ability estimates as a conditioning variable (given student ability of theta, what is the probability of error?). This system uses difficulty_category as a proxy for student ability, but difficulty_category is assigned to the PAPER, not the student, and is itself potentially miscalibrated.

Consequences:

Residuals are biased toward zero for questions in strata with few observations.
The calibration algorithm systematically under-corrects for the most common question types and over-corrects for rare types.
In extreme cases, a question that every student gets wrong but that is in a "hard" difficulty_category gets a residual near zero (because the baseline already expects high error rate), so its difficulty is not adjusted upward.

Prevention:

Compute baselines using leave-one-out: for each question, exclude that question's responses from the baseline computation.
Alternatively, validate whether the difficulty_category of papers is itself well-calibrated before relying on it for stratification.
Report the number of observations per stratum. If any stratum has < 100 observations, flag it as unreliable.
Consider whether using student mastery level (already computed by MasteryCalculator) would be a better stratification variable than paper difficulty_category.

Warning signs:

Baseline error rates for some strata are based on fewer than 50 observations.
Calibration delta is near zero for most questions (suggesting baselines are absorbing all the signal).
Residual distribution is symmetric and centered at zero even for questions where subject-matter experts know the difficulty is wrong.

Phase: Phase 1 (audit baseline computation), Phase 2 (consider alternative stratification).

Pitfall 5: Wiring Calibrated Difficulty Into Exam Assembly Without A/B Testing

What goes wrong: After validation, flipping enable_difficulty_distribution to true and having QuestionDifficultyResolver override all difficulty values in one deployment. If calibration is systematically biased in any direction, ALL exams immediately get worse.

Why it happens: The existing code path in LearningAnalyticsService has enable_difficulty_distribution as a boolean flag checked per-request. The natural "fix" is to set it to true globally. But the code has multiple assembly paths (diagnostic, practice, mistake, textbook, knowledge_points), each with different fallback behavior.

Consequences:

If calibrated difficulty is systematically 0.05 lower than true difficulty, all exams become slightly too hard. Students struggle, completion rates drop, and the feedback loop (calibration absorbs these results) reinforces the bias.
Some assembly paths may not use QuestionDifficultyResolver at all, creating inconsistency between exam types.

Prevention:

Roll out calibrated difficulty to ONE assembly type first (e.g., practice) with a feature flag.
Run A/B comparison: for identical student/context parameters, generate exams with original vs calibrated difficulty and compare outcome distributions.
Add logging: for every exam assembled, log which difficulty source was used and the difficulty distribution of selected questions.
Monitor exam outcome metrics after rollout: average score, completion rate, time-per-question. If any metric degrades by > 5%, auto-revert.

Warning signs:

No per-assembly-type rollout plan.
No logging of difficulty_source in assembled exam metadata.
enable_difficulty_distribution is a single boolean controlling all paths.

Phase: Phase 2 (pipeline wiring), requires feature flags and monitoring.

Pitfall 6: Boundary Effects in Difficulty Distribution Buckets

What goes wrong: DifficultyDistributionService::classifyQuestionByDifficulty() uses strict boundary comparisons. A question with difficulty exactly 0.25 falls into different buckets depending on category:

Category 2: difficulty >= 0.25 && difficulty <= 0.5 goes to primary_medium
Category 1: difficulty >= 0 && difficulty <= 0.25 goes to primary_medium

But a question at 0.25001 goes to primary_high for category 2. This means tiny calibration changes (0.001) can shift questions between buckets, causing the assembled exam's difficulty profile to be dramatically different.

Why it happens: Boundary values are hard-coded without hysteresis or soft boundaries. The calibrated difficulty is stored with 4 decimal places, making boundary crossings likely.

Consequences:

A 0.001 change in calibrated difficulty can move a question from "primary" to "other" bucket.
Exams assembled right after a calibration run may have very different difficulty profiles than exams assembled right before, even for the same student.
The getSupplementOrder() fallback logic kicks in differently depending on boundary crossings, introducing non-deterministic exam composition.

Prevention:

Add margin/overlap to bucket boundaries. A question at 0.24-0.26 should be eligible for BOTH adjacent buckets, with a probability proportional to distance from boundary center.
Alternatively, when selecting questions for a bucket, include questions within a buffer zone (e.g., +/- 0.03 from the boundary) and randomly select from the expanded pool.
Log bucket population counts before and after calibration to detect boundary-driven shifts.

Warning signs:

Before/after calibration, the same question pool produces exams with noticeably different difficulty distributions.
Many questions have difficulty values clustering at bucket boundaries (0.25, 0.5, 0.75).
groupQuestionsByDifficultyRange() returns very uneven bucket sizes.

Phase: Phase 2 (when wiring distribution service), Phase 3 (when tuning difficulty_category recommendation).

Pitfall 7: Time Decay Creating Recency Bias in Calibration

What goes wrong: The 45-day half-life decay means responses older than ~6 months contribute < 3% weight. For K12 math, this creates a systematic bias: questions that were easier in earlier grades (when students were learning the concept) appear harder than they really are for the current cohort, because the only remaining data is from students who struggled recently.

Why it happens: The algorithm is designed for "dynamic" difficulty that responds to recent trends. But K12 math content has strong grade-level alignment -- a question appropriate for Grade 7 will ALWAYS be answered by Grade 7 students. The time decay doesn't differentiate between "the question got harder" and "we're seeing different students."

Consequences:

At the start of a new semester, calibration data is sparse (few recent responses), so calibrated difficulty reverts toward original (potentially wrong) values.
Questions used primarily for review (appearing in diagnostic/review exams) have time-decayed away most of their easy responses, making them appear harder than they are.
Seasonal patterns (exam periods vs. vacation) create oscillation in calibrated difficulty.

Prevention:

Consider whether time decay is appropriate at all for K12 content. The underlying difficulty of a math question does not change over time in the way that, say, a sports prediction model's inputs would.
If keeping time decay, extend the half-life to 120-180 days to span a full semester.
Add a minimum data window: only apply decay-adjusted calibration if there are at least N responses within the decay window. Otherwise, use the full-history estimate.
Monitor calibrated difficulty drift over time. If difficulty trends correlate with calendar time rather than with changes in student population, the decay is causing harm.

Warning signs:

Calibrated difficulty for stable questions drifts systematically over the school year.
Questions not used in the last 2 months have calibrated difficulty reverting toward 0.5 (the prior).
Health scale is frequently below 0.8, indicating the algorithm itself detects instability.

Phase: Phase 1 (validate with and without time decay to see which produces better out-of-sample predictions).

Pitfall 8: Health Monitor Degeneracy -- Bad Predictions Cause Self-Reinforcing Caution

What goes wrong: The getHealthScaleForType() method reduces the step size when recent Brier scores are worsening. But if the INITIAL calibration was wrong (bad original difficulty), then:

The first predictions have high Brier scores (wrong difficulty -> wrong error rate prediction).
Health monitor sees worsening and reduces health_scale (multiplies step by 0.78 or 0.82).
With smaller steps, calibration converges slower toward the true difficulty.
Brier scores remain high because convergence is slow.
Health monitor further reduces step size.

This creates a death spiral where the system becomes too cautious to ever self-correct.

Why it happens: The health monitor compares brier_after vs brier_before per event. If the calibration is already wrong, both "before" and "after" are bad, but "after" can be slightly worse due to noise. The cumulative delta being positive triggers the 0.78 multiplier.

Consequences:

Questions with badly wrong original difficulty get stuck near the wrong value because the health monitor prevents aggressive correction.
The system appears "stable" (small deltas) but is actually stuck at wrong values.
The health_scale cache (5-minute TTL) means a few bad events can suppress correction for an extended period.

Prevention:

Add a minimum health_scale floor higher than 0.45 (current floor). Suggest 0.6 minimum.
Only activate health monitoring after a minimum number of events (current threshold is 80, which is reasonable, but consider requiring per-question rather than per-type).
Include a "reset" mechanism: if health_scale stays below 0.7 for more than 14 days, force a full recalibration from scratch (ignoring previous calibrated values).
Track the distribution of health_scale values. If most types are below 0.7, the system is in a degenerate state.

Warning signs:

Health scale for most question types is at or near the 0.45 floor.
Calibrated difficulty delta (calibrated - original) distribution is narrow, suggesting the algorithm is barely adjusting anything.
The calibration was "validated" but exam quality metrics haven't improved.

Phase: Phase 1 (audit health monitor behavior during validation), Phase 2 (tune parameters before production).

Moderate Pitfalls

Pitfall 9: Online vs Batch Mode Inconsistency

What goes wrong: The batch mode (estimateByStratifiedResidual) and online mode (estimateOnlineBySingleOutcome) use different logic:

Batch mode: step limit of 0.0 when weighted_attempts < 8 (no adjustment at all).
Online mode: always adjusts, with maxStep = 0.30 * (0.35 + 0.65 * confidence) * healthScale. At weighted_attempts = 1, confidence = 0.0125, giving maxStep ~ 0.11 -- a non-trivial adjustment from a single data point.

A question that gets batch-recalibrated with 7 responses gets NO adjustment. The same question that gets 7 online updates gets 7 incremental adjustments. The final calibrated difficulty can differ significantly.

Prevention: Align the minimum sample thresholds between modes. Either both modes should require 8+ weighted attempts before any adjustment, or both should allow incremental updates. Document which mode is the source of truth.

Phase: Phase 1 (resolve during validation -- both modes should produce similar results on the same data).

Pitfall 10: Question Pool Exhaustion When Difficulty Distribution Is Enabled

What goes wrong: DifficultyDistributionService defines narrow ranges for each category. Category 0 requires 90% of questions with difficulty 0-0.1. If the question pool for a given knowledge point + question_type has very few questions in that range, the assembly either:

Falls back to getSupplementOrder() and fills with "other" bucket questions (defeating the difficulty targeting), or
Fails to assemble enough questions and returns a partial exam.

Prevention: Before enabling difficulty distribution, analyze the question pool by knowledge point to verify sufficient coverage at each difficulty level. If coverage is sparse, either:

Relax the distribution percentages for sparse knowledge points.
Expand bucket boundaries when the pool is small.
Log "pool exhaustion" events and track them as a KPI.

Phase: Phase 2 (before enabling enable_difficulty_distribution by default).

Pitfall 11: Ignoring Question Type Heterogeneity in Difficulty Perception

What goes wrong: Choice questions (multiple choice) have a baseline ~25% correct-by-guessing rate. Fill-in-the-blank and open-ended questions have no guessing bonus. The calibration algorithm treats "is_correct" the same across all types. A choice question with calibrated difficulty 0.5 does NOT have the same "true difficulty" as a fill-in question with calibrated difficulty 0.5.

Prevention:

Stratify by question_type (which the system already does for baselines) but also consider applying a guessing correction to choice questions' error rates before calibration.
When matching difficulty to student level, account for question type: a student needs higher mastery to answer a difficulty-0.5 open-ended question than a difficulty-0.5 multiple-choice question.

Phase: Phase 3 (when building the mastery-to-difficulty_category recommendation).

Pitfall 12: Mastery-to-Difficulty Category Mapping Without Ground Truth

What goes wrong: The project plans to map student mastery (0-1 continuous value from MasteryCalculator) to difficulty_category (0-4 discrete levels). Without empirical validation, the mapping will be based on intuition. Common mistakes:

Mapping mastery 0.8 to difficulty_category 1 (too easy for the student).
Using a linear mapping when the relationship between mastery and appropriate difficulty is non-linear (zone of proximal development suggests students learn best at difficulty slightly above their current mastery).

Prevention:

Use historical data to find the optimal mapping: for each (mastery_level, difficulty_category) pair, what was the average score? The sweet spot is where the student gets 60-75% correct (zone of proximal development).
Validate the mapping by checking if exams assembled using the mapping produce the target score range (60-75%).
Allow the mapping to vary by grade level -- what is "appropriate challenge" differs for Grade 3 vs Grade 10.

Phase: Phase 3 (mastery-to-difficulty recommendation).

Pitfall 13: Calibration Feedback Loop Divergence

What goes wrong: Once calibrated difficulty drives exam assembly, the calibration also absorbs the outcomes of those exams. If the calibration overestimates difficulty (marks questions as harder than they are), the system assigns them to higher difficulty_category exams where students are stronger. Stronger students answer correctly, causing the calibration to further lower the difficulty. The system oscillates.

Prevention:

Include student ability (mastery level) as a covariate in the calibration algorithm. A question answered correctly by a high-mastery student provides different information than one answered correctly by a low-mastery student.
Track "calibration drift" over time: compare the distribution of calibrated difficulties at time T to time T+30d. If the distribution is shifting systematically, the feedback loop may be diverging.
Add a "ground truth" anchor: periodically have subject-matter experts rate a sample of questions. Compare expert ratings to calibrated values. If they diverge, increase shrinkage toward the expert prior.

Phase: Phase 2 (monitor after wiring), Phase 3 (add student ability as covariate).

Minor Pitfalls

Pitfall 14: Algorithm Meta JSON Bloat

What goes wrong: The algorithm_meta JSON column stores recent_events (up to 30 events per question). With thousands of calibrated questions, this column grows rapidly. Each online update reads and re-writes the full JSON. Over months, this table becomes the largest in the database by storage, and queries slow down.

Prevention: Move recent_events to a separate table (one row per event) or cap the JSON size more aggressively. Consider dropping event-level detail after 30 days and keeping only aggregated metrics.

Phase: Phase 2 (before heavy production use of online mode).

Pitfall 15: Race Condition in Online Calibration Updates

What goes wrong: Two simultaneous grading events for the same question (e.g., two students submit papers at the same time) both read the same prev_difficulty from the calibration table, compute their updates independently, and the second write overwrites the first. The net effect is that one update is lost.

The current upsert on question_bank_id is atomic at the row level, but the read-compute-write cycle in updateOnlineFromPaper() is NOT atomic. Between reading existing (line 116) and writing upserts (line 212), another process can update the same row.

Prevention: Use SELECT ... FOR UPDATE or database-level locks when reading existing calibration data for questions that are about to be updated. Alternatively, use an incremental approach: UPDATE ... SET weighted_attempts = weighted_attempts + X, weighted_wrong = weighted_wrong + Y instead of computing the new values in PHP.

Phase: Phase 2 (before production deployment of online mode at scale).

Pitfall 16: Default Difficulty (0.5) Contaminating Calibration

What goes wrong: Questions without a set difficulty default to 0.5 in hydrateQuestions() (line 723: 'difficulty' => isset($question['difficulty']) ? (float) $question['difficulty'] : 0.5). When these questions enter calibration, original_difficulty becomes 0.5. With Bayesian shrinkage toward 0.5 (the Beta(2,2) prior mode), these questions' calibrated difficulty will be pulled toward 0.5 regardless of actual difficulty.

Prevention: Distinguish between "explicitly set to 0.5" and "unset, defaulted to 0.5." Only apply shrinkage toward the prior for questions where the original difficulty was explicitly set. For unset questions, use the empirical error rate directly (with wider confidence intervals).

Phase: Phase 1 (data audit should count questions with default difficulty).

Phase-Specific Warnings

Phase Topic	Likely Pitfall	Mitigation	Severity
Calibration validation	Circular validation (Pitfall 1)	Temporal train/test split	CRITICAL
Data audit	Dual difficulty standard (Pitfall 2)	One-time normalization + flag column	CRITICAL
Data audit	Low sample size (Pitfall 3)	Report sample distribution; set minimum threshold	CRITICAL
Algorithm audit	Baseline self-reference (Pitfall 4)	Leave-one-out baselines; audit stratum sizes	HIGH
Algorithm audit	Time decay appropriateness (Pitfall 7)	Compare with/without decay in validation	HIGH
Algorithm audit	Health monitor degeneracy (Pitfall 8)	Raise floor to 0.6; add reset mechanism	HIGH
Pipeline wiring	No A/B testing (Pitfall 5)	Feature flag per assembly type; monitor metrics	CRITICAL
Pipeline wiring	Boundary effects (Pitfall 6)	Soft boundaries or buffer zones	HIGH
Pipeline wiring	Mode inconsistency (Pitfall 9)	Align thresholds between batch and online	MEDIUM
Pipeline wiring	Pool exhaustion (Pitfall 10)	Pre-analyze coverage; log exhaustion events	MEDIUM
Pipeline wiring	Race condition (Pitfall 15)	Row-level locks or incremental updates	MEDIUM
Difficulty recommendation	Type heterogeneity (Pitfall 11)	Guessing correction for choice questions	MEDIUM
Difficulty recommendation	Mastery mapping without ground truth (Pitfall 12)	Use historical data to find optimal mapping	HIGH
Ongoing operations	Feedback loop divergence (Pitfall 13)	Track drift; periodic expert anchoring	HIGH
Ongoing operations	JSON bloat (Pitfall 14)	Separate events table or aggressive capping	LOW
Data quality	Default 0.5 contamination (Pitfall 16)	Distinguish set vs unset difficulty	LOW

Validation Checklist Before Each Phase

Before Phase 1 (Calibration Validation):

Audit questions.difficulty distribution for dual-standard evidence
Count questions with default difficulty (0.5) vs explicitly set
Check sample size distribution: what % of questions have >= 10 responses?
Define temporal split point for validation
Decide: validate with or without time decay? (Test both)

Before Phase 2 (Pipeline Wiring):

Verify calibration improved out-of-sample prediction (Phase 1 output)
Set minimum weighted_attempts threshold for QuestionDifficultyResolver
Implement per-assembly-type feature flag
Add difficulty_source logging to all assembly paths
Analyze question pool coverage by difficulty bucket per knowledge point

Before Phase 3 (Mastery-to-Difficulty Recommendation):

Collect empirical data: for each (mastery_quintile, difficulty_category), what is average score?
Identify zone of proximal development: which difficulty_category produces 60-75% correct for each mastery level?
Check for question type interaction: does the optimal mapping differ for choice vs fill vs open-ended?

Sources

Codebase analysis: QuestionDifficultyCalibrationService.php, QuestionDifficultyResolver.php, DifficultyDistributionService.php, IntelligentExamController.php, LearningAnalyticsService.php
IRT/correspondence theory: Lord & Novick "Statistical Theories of Mental Test Scores" (foundational work on item calibration and Bayesian estimation)
Adaptive testing design: Wainer "Computerized Adaptive Testing: A Primer" (pitfalls of item pool coverage and difficulty targeting)
Zone of proximal development in adaptive systems: Vygotsky-based calibration targets in K12 systems are widely discussed in educational measurement literature
Bayesian shrinkage in item calibration: Mislevy & Bock (IRT parameter estimation with informative priors)
Confidence level: HIGH for codebase-specific pitfalls (directly observed in code). HIGH for domain pitfalls (IRT/calibration theory is well-established).

PITFALLS.md 29 KB Histórico Raw

Domain Pitfalls: K12 Math Difficulty Calibration & Intelligent Exam

Critical Pitfalls

Pitfall 1: Validating Calibration Against the Same Data That Produced It (Circular Validation)

Pitfall 2: Dual Difficulty Standard (0-1 vs 0-5) Silent Corruption

Pitfall 3: Low Sample Size Questions Dominating Calibration Results

Pitfall 4: Stratified Baseline Computed From the Same Population Being Calibrated

Pitfall 5: Wiring Calibrated Difficulty Into Exam Assembly Without A/B Testing

Pitfall 6: Boundary Effects in Difficulty Distribution Buckets

Pitfall 7: Time Decay Creating Recency Bias in Calibration

Pitfall 8: Health Monitor Degeneracy -- Bad Predictions Cause Self-Reinforcing Caution

Moderate Pitfalls

Pitfall 9: Online vs Batch Mode Inconsistency

Pitfall 10: Question Pool Exhaustion When Difficulty Distribution Is Enabled

Pitfall 11: Ignoring Question Type Heterogeneity in Difficulty Perception

Pitfall 12: Mastery-to-Difficulty Category Mapping Without Ground Truth

Pitfall 13: Calibration Feedback Loop Divergence

Minor Pitfalls

Pitfall 14: Algorithm Meta JSON Bloat

Pitfall 15: Race Condition in Online Calibration Updates

Pitfall 16: Default Difficulty (0.5) Contaminating Calibration

Phase-Specific Warnings

Validation Checklist Before Each Phase

Before Phase 1 (Calibration Validation):

Before Phase 2 (Pipeline Wiring):

Before Phase 3 (Mastery-to-Difficulty Recommendation):

Sources

PITFALLS.md 29 KB

Histórico Raw