Feature Landscape

Domain: K12 math difficulty calibration and intelligent exam matching Researched: 2026-04-16 Confidence: HIGH (codebase analysis); MEDIUM (platform feature comparisons based on web research)

Table Stakes

Features the system must have. Without these, difficulty calibration is meaningless and exam assembly produces poor matches.

#	Feature	Why Expected	Complexity	Notes
TS-1	Historical data backtesting with temporal split	Every validated adaptive system (ALEKS, Khan Academy, IXL) validates its parameter estimation against held-out data before production use. Without this, the entire calibration pipeline is an unverified hypothesis.	Medium	`QuestionDifficultyCalibrationAnalyzer` already computes per-question stats and Pearson correlation. Needs: temporal train/test split, aggregated Brier score, MAE, PASS/FAIL gate. The data (`paper_questions`, `questions`, `papers`) already exists.
TS-2	Difficulty scale normalization (0-1 vs 0-5 unification)	The codebase has `normalizeDifficultyValue()` in `QuestionDifficultyCalibrationService` for calibration input, but `LearningAnalyticsService` loads raw `questions.difficulty` without normalization during assembly. This is a correctness bug, not a nice-to-have. A 0-5 value of 0.4 treated as 0-1 (40% difficulty) inverts the intended difficulty.	Low	Extract and centralize the existing normalization logic. Apply at the question-loading boundary in `LearningAnalyticsService`. One new service class, no algorithm changes.
TS-3	Calibrated difficulty used in exam assembly	The calibration algorithm (`stratified_residual_eb_v2`) runs on every grading event (`updateOnlineFromPaper`), writes to `question_difficulty_calibrations`, and `QuestionDifficultyResolver.applyCalibratedDifficulty()` exists -- but it is not called in the main assembly path (`LearningAnalyticsService.generateIntelligentExam`). The calibration loop is complete but disconnected.	Low	Wire the existing `QuestionDifficultyResolver` call into `LearningAnalyticsService.selectQuestions()`. The resolver already handles the fallback chain (calibrated > normalized original). Requires TS-1 PASS gate to be open first.
TS-4	Difficulty distribution enabled by default	`enable_difficulty_distribution` defaults to `false` in `LearningAnalyticsService` line 1554. Only `ExamTypeStrategy` sets it to `true`. The API main path never activates distribution. Without distribution, all questions are selected by raw difficulty matching without the tiered low/medium/high bucket strategy.	Low	Change default to `true` after TS-1 and TS-2 are complete. The `DifficultyDistributionService` logic is fully implemented and tested.
TS-5	Calibration coverage metrics	Teachers and admins need to know what percentage of questions have calibrated values. ALEKS reports item parameter coverage; IXL shows diagnostic completeness. If only 20% of questions have calibration data, the system should not claim to use calibrated difficulty.	Low	Count `question_difficulty_calibrations` rows vs total active questions. Add to the existing `AnalyzeQuestionDifficultyCalibrationCommand` output. No new infrastructure needed.
TS-6	Per-question calibration confidence indicator	IRT systems report standard error of estimation per item. The existing `algorithm_meta` stores `recent_events` with Brier/log-loss per event, and the analyzer computes `calibration_effective_attempts` (time-decay weighted sample size). These need to surface as a single confidence metric per question.	Low	Derive from existing data: `effective_attempts >= 10` = HIGH confidence, `5-9` = MEDIUM, `< 5` = LOW. Already computed in the analyzer; just needs a reusable service method and API exposure.

Differentiators

Features that set this platform apart from basic question banks. These create the "answer -> calibrate -> precise exam -> re-answer" closed loop.

#	Feature	Value Proposition	Complexity	Notes
D-1	Mastery-based difficulty category auto-recommendation	Currently `difficulty_category` (0-4) is passed from the external API call with no validation against student level. ALEKS achieves 90%+ success on "ready-to-learn" concepts by matching difficulty to learner state. Auto-recommending category from mastery eliminates the teacher having to guess which tier to assign.	Medium	New `DifficultyCategoryRecommender` service. Maps mastery (from `MasteryCalculator`, already per-knowledge-point) to category via thresholds. Must handle multi-kp exams: weight weaker knowledge points more heavily. Depends on TS-3 (calibrated difficulty in assembly).
D-2	Shadow mode comparison logging	Before fully activating calibrated difficulty, run both paths (raw vs calibrated) in parallel and log comparisons. This is standard in production ML systems. Reveals real-world impact of calibration without risking exam quality.	Low	Log raw vs calibrated average difficulty, delta distribution, coverage percentage. No behavioral change. Can activate immediately after TS-1 passes.
D-3	Calibration health monitoring dashboard	The existing `getHealthScaleForType()` monitors Brier/log-loss delta and auto-scales step size (0.45-1.0x), but this data is only in logs. A dashboard showing: overall health scale, per-type Brier trend, coverage percentage, drift alerts, top outlier questions. IXL's "Trouble Spots" report is analogous for student-level analytics.	Medium	New `calibration_health_snapshots` table (daily aggregation). Scheduled artisan command to compute daily. Admin panel to visualize. The computation logic already exists in `getHealthScaleForType()`.
D-4	Calibration drift detection and alerting	Over time, question difficulty can shift (curriculum changes, student population changes). The existing algorithm has time decay (45-day half-life) but no explicit drift detection. Detect when calibrated values systematically diverge from recent observations.	Medium	Compare rolling 7-day vs 30-day average Brier score. If 7-day is significantly worse (>20% increase), flag as drift. The data is already in `algorithm_meta.recent_events`. Needs: scheduled check, notification mechanism.
D-5	Outlier quarantine with human review	When a question's calibrated difficulty differs from original by more than a threshold (e.g., delta > 0.30), quarantine it for human review rather than silently using the extreme value. This catches edge cases: wrong answer keys, ambiguous questions, data entry errors.	Low	`CalibrationVerificationGate` checks delta after each calibration update. Flagged questions still get calibrated values but are marked for review. Admin UI shows quarantine list.
D-6	Per-knowledge-point calibration accuracy breakdown	The existing analyzer computes Pearson correlation globally. Breaking this down by knowledge point reveals where calibration works well and where it fails. Some knowledge points may have too few questions or too little variance for calibration to be meaningful.	Medium	Extend `QuestionDifficultyCalibrationAnalyzer` to group by `kp_code`. Requires joining `questions` -> `knowledge_points` which already exists. Surface in CLI report and API.
D-7	Student-facing difficulty confidence indicator	Show students a visual indicator of how well-matched the exam is to their level. Analogous to Khan Academy's mastery progress bars. "This exam is calibrated for your current level" vs "This exam covers new difficulty territory."	Low	Simple badge based on: was calibrated difficulty used? (difficulty_source='calibrated') AND coverage > 60%? Client-side rendering, backend just provides metadata.

Anti-Features

Features to explicitly NOT build. These would harm the system or waste effort.

#	Anti-Feature	Why Avoid	What to Do Instead
AF-1	Write calibrated values back to `questions.difficulty`	Destroys the immutable original reference value. Makes debugging impossible. Violates the explicit project constraint. Creates data integrity risk -- if calibration goes wrong, the original is gone.	Keep dual-table design. `questions.difficulty` is append-only. `question_difficulty_calibrations` is the overlay. `QuestionDifficultyResolver` handles priority.
AF-2	Full IRT 3PL model with discrimination and guessing parameters	The system has a working algorithm (`stratified_residual_eb_v2`) that was designed for this specific use case. Switching to 3PL would require: (a) reimplementing the entire calibration engine, (b) much more data per question to estimate 3 parameters, (c) guessing parameter is meaningless for K12 math (students rarely guess systematically).	Validate the existing algorithm first (TS-1). Only consider algorithm changes if backtesting reveals systematic failure. The existing stratified baseline + residual + Bayesian shrinkage is well-suited.
AF-3	Real-time adaptive testing (CAT) within a single exam	CAT requires: calibrated item pool (we are building this), real-time ability estimation during exam, item selection algorithm, exposure control. This is a fundamentally different product (computer-based testing) from the current paper/worksheet assembly model. Massive scope expansion.	Focus on accurate pre-exam difficulty matching. The exam is assembled once, not adapted mid-session. The "adaptive" element is between exams (mastery changes -> different difficulty category next time).
AF-4	Global difficulty category per student	A student may be category 2 (intermediate) in algebra but category 0 (zero-foundation) in geometry. Assigning one global category produces mismatched exams -- too hard in geometry, too easy in algebra.	Per-knowledge-point mastery -> per-knowledge-point difficulty recommendation (D-1). Aggregate with weakness weighting for multi-kp exams.
AF-5	Calibration without minimum sample size enforcement	Calibrating a question based on 1-2 answers produces wildly unreliable estimates. The algorithm already has `SHRINKAGE_M0_MIN = 8.0` as a prior strength, but the assembly path should not use calibrated values for questions below a sample threshold.	Enforce minimum effective attempts (e.g., >= 5) before using calibrated value in assembly. Below threshold, fall back to normalized original. The `QuestionDifficultyResolver` already has the data to implement this.
AF-6	Automatic algorithm parameter tuning	Automatically adjusting `alpha`, `max_step`, `half_life_days` based on backtest results. Sounds appealing but risks overfitting to historical data and removing human oversight from a critical algorithmic decision.	Provide backtest reports with parameter sensitivity analysis (run backtest at 3-5 parameter settings). Let humans make the final tuning decision.
AF-7	Student-level difficulty prediction (this student will get this question wrong)	The calibration system predicts population-level difficulty (what fraction of students will answer incorrectly). Individual prediction requires a student ability model (like IRT theta), which is a separate system. Conflating the two produces unreliable results.	Use mastery (from `MasteryCalculator`) for individual-level assessment. Use calibration for question-level difficulty estimation. Combine them at exam assembly time, not at prediction time.

Feature Dependencies

TS-1 (Backtesting)
  |
  +---> TS-3 (Calibrated difficulty in assembly) [requires PASS gate]
  |       |
  |       +---> TS-4 (Distribution enabled by default) [safe after calibrated assembly]
  |       |
  |       +---> D-1 (Auto difficulty category) [needs calibrated assembly working]
  |       |
  |       +---> D-7 (Student confidence indicator) [needs calibrated assembly]
  |
  +---> D-2 (Shadow mode) [can run in parallel with TS-3]

TS-2 (Scale normalization)
  |
  +---> TS-3 (Calibrated difficulty in assembly) [needs consistent scale]

TS-5 (Coverage metrics)  [independent]
TS-6 (Confidence indicator) [independent, uses existing data]

D-3 (Health dashboard) ---> D-4 (Drift detection) [dashboard provides baseline]
D-5 (Outlier quarantine) [independent, can build anytime]
D-6 (Per-kp breakdown) [independent, extends existing analyzer]

Critical path: TS-1 -> TS-2 -> TS-3 -> TS-4 -> D-1

This is the minimum path to the closed loop described in the project vision ("answer -> calibrate -> precise exam -> re-answer").

MVP Recommendation

Phase 1 -- Validation and Foundation (must ship first)

TS-1: Historical data backtesting -- proves the algorithm works
TS-2: Difficulty scale normalization -- fixes the correctness bug
TS-5: Calibration coverage metrics -- shows what percentage is calibrated
D-2: Shadow mode logging -- reveals real-world calibration impact

Phase 2 -- Production Integration (ship after Phase 1 PASS)

TS-3: Wire calibrated difficulty into assembly
TS-4: Enable difficulty distribution by default
TS-6: Per-question confidence indicator
AF-5 enforcement: Minimum sample size before using calibrated value

Phase 3 -- Adaptive Intelligence (ship after Phase 2 stable)

D-1: Mastery-based difficulty category auto-recommendation
D-5: Outlier quarantine with human review
D-7: Student-facing confidence indicator

Defer:

D-3 (Health dashboard): Valuable but not on the critical path. Can be built incrementally. The existing getHealthScaleForType() provides the raw data; a dashboard is a presentation layer.
D-4 (Drift detection): Requires accumulated health snapshots. Needs D-3 running for at least 30 days first.
D-6 (Per-kp breakdown): Nice-to-have analytics. Not blocking any other feature.
AF-6 (Auto parameter tuning): Explicitly out of scope. Human decision.

Competitive Feature Comparison

How this system's features compare to established platforms:

Feature	ALEKS	IXL	Khan Academy	This System
Difficulty calibration method	Knowledge Space Theory (probabilistic knowledge states)	Proprietary adaptive algorithm	IRT-based mastery	Stratified residual empirical Bayes
Validation approach	Billions of data points, research papers	Real-time diagnostic validation	A/B testing, mastery prediction accuracy	Backtesting (to build)
Difficulty granularity	Per-topic per-student	Per-skill per-student	Per-exercise per-student	Per-question (calibrated), per-kp per-student (mastery)
Adaptive exam assembly	Full CAT (selects next item based on current ability estimate)	Adaptive practice (not exam)	Mastery-based progression	Static assembly with difficulty distribution (pre-exam)
Health monitoring	Implicit (built into KST algorithm)	Real-Time Diagnostic, Trouble Spots	Internal accuracy metrics	Health scale (exists), Dashboard (to build)
Student-facing feedback	"Ready to learn" / "Not ready" indicators	Diagnostic strand analysis, progress reports	Mastery progress bars, streak counters	Calibration badge (to build)
Admin/teacher analytics	Learning progress, knowledge pie chart	Live Classroom, Trouble Spots, reports	Coach dashboard, assignment analytics	CLI report (exists), Dashboard (to build)

Key insight: This system's unique advantage is the dual-difficulty model (original + calibrated overlay) with an explicit validation gate. No other platform exposes calibration confidence or separates original vs calibrated difficulty. This transparency is a differentiator for institutional customers who need to audit and trust the algorithm.

Sources

Direct codebase analysis: QuestionDifficultyCalibrationService (974 lines), QuestionDifficultyCalibrationAnalyzer, QuestionDifficultyResolver, DifficultyDistributionService, MasteryCalculator, IntelligentExamController, ExamAnswerAnalysisService, LearningAnalyticsService (HIGH confidence)
ALEKS: Knowledge Space Theory, billions of data points, 90%+ success rate for ready-to-learn concepts (web research, MEDIUM confidence)
IXL: Real-Time Diagnostic, Diagnostic Strand Analysis, Trouble Spots, Live Classroom reports (web research, MEDIUM confidence)
Khan Academy: Mastery learning system, progress tracking, internal accuracy validation (training data, MEDIUM confidence)
IRT validation methodology: Brier score, log-loss, Pearson correlation for item parameter validation (training data + Wikipedia, HIGH confidence -- standard psychometrics)
ARCHITECTURE.md research file: Component boundaries, data flows, build order (HIGH confidence -- derived from codebase)

FEATURES.md 16 KB History Raw