Technology Stack

Project: Math CMS — Difficulty Calibration & Intelligent Exam Researched: 2026-04-16

Recommended Stack

Core Framework (existing — no changes)

Technology	Version	Purpose	Why
Laravel Framework	^12.0	Backend application framework	Already in production, entire codebase built on it. No reason to change.
PHP	^8.2	Runtime	Already in production. Supports enums, readonly properties, named arguments, fibers — all useful for the statistical code.
MySQL	existing	Primary database	Stores `questions`, `papers`, `paper_questions`, `question_difficulty_calibrations`, `mistake_records`. All calibration data lives here.
Redis + Predis	existing	Cache + queue	Used for baseline caching (`$this->baselineCache`), async job queue for `AssembleExamTaskJob`.

Statistical Validation — Primary Addition

Technology	Version	Purpose	Why
markrogoyski/math-php	^2.13	Pure-PHP math/statistics library	The only viable PHP-native option. Provides correlation, significance testing, probability distributions (Beta, Normal, Binomial, Student's t), descriptive statistics, and ANOVA. No external dependencies. Required for Brier score decomposition, confidence intervals, and backtest validation.

Confidence: HIGH — Verified on Packagist (v2.13.0, actively maintained, 3k+ GitHub stars, pure PHP, no C extensions).

What math-php provides that the project needs

Correlation & Significance Testing
- Correlation::pearson() — validate calibrated difficulty vs empirical error rate (the analyzer already has a custom Pearson implementation; math-php's version is battle-tested and produces p-values)
- Significance::rCritical() / Significance::tpValue() — determine whether observed correlations are statistically significant, not just numerically large
- Correlation::spearman() — rank-based correlation, more robust for ordinal difficulty categories (0-4) than Pearson
Descriptive Statistics
- Descriptive::standardDeviation(), Descriptive::mean(), Descriptive::median(), Descriptive::interquartileRange() — for bin analysis and distribution sanity checks
- Descriptive::coefficientOfVariation() — compare calibration stability across question types
Probability Distributions
- BetaDistribution — directly supports the Beta(2,2) prior used in stratified_residual_eb_v2; enables computing credible intervals around calibrated difficulty
- NormalDistribution — for confidence interval construction around error rates
- BinomialDistribution — for computing exact probability of observed correct/wrong counts given hypothesized difficulty
ANOVA
- ANOVA::oneWay() — test whether calibration deltas differ significantly across question types or difficulty categories
Regression
- LinearRegression::create() — for fitting calibration quality over time (is the system getting better?)

Testing Infrastructure (existing — extend)

Technology	Version	Purpose	Why
PHPUnit	^11.5.3	Test framework	Already in `require-dev`. Extend with calibration validation tests.
Mockery	^1.6	Test mocking	Already in `require-dev`. For mocking DB queries in unit tests.

Backtesting Infrastructure (new — build internally)

Technology	Version	Purpose	Why
Custom Artisan Command	new	`questions:difficulty-backtest`	Historical replay of calibration algorithm. Split answer data chronologically, run algorithm on first N%, measure prediction accuracy on remaining data. No external library — this is domain-specific and should leverage existing `QuestionDifficultyCalibrationAnalyzer`.
Custom PHPUnit Test Suite	new	`tests/Feature/DifficultyCalibrationBacktestTest.php`	Automated regression tests for calibration accuracy. Run in CI to catch algorithm regressions.

Alternatives Considered

Statistical Libraries

Category	Recommended	Alternative	Why Not
PHP math library	math-php ^2.13	Write custom implementations	The codebase already has custom `pearsonCorrelation()` in the analyzer. math-php is more reliable, handles edge cases (zero-division, small samples), and provides p-values. Rewriting statistical functions is error-prone and wastes time.
PHP math library	math-php ^2.13	`phpscience/statistics`	Stale, fewer features, lower community adoption. math-php has 10x the feature coverage.
PHP math library	math-php ^2.13	Call Python/R via shell	Adds runtime dependency on external language, introduces serialization overhead, deployment complexity. The math needed here is not computationally heavy — pure PHP is sufficient.
PHP math library	math-php ^2.13	Call external stats API	Unnecessary network dependency for what are fundamentally simple statistical computations. Adds latency and failure modes.

IRT / CAT Libraries

Category	Decision	Alternative	Why Not
Full IRT framework	Do not adopt	`irt` R package, `py-irt` Python	The project uses a custom `stratified_residual_eb_v2` algorithm, not standard IRT. Adopting a full IRT framework would mean rewriting the calibration pipeline. Current algorithm works — validate it, don't replace it.
CAT engine	Do not adopt now	`catsim` Python, `mirtCAT` R	The exam assembly pipeline (`IntelligentExamController`) already has a working strategy-based approach. CAT requires a fundamentally different architecture (sequential item selection with real-time ability estimation). This is a future consideration, not current scope.
Adaptive testing	Defer to Phase 3+	Standard CAT algorithms	The current priority is validating calibration and wiring it into the existing assembly pipeline. Adaptive testing is the natural evolution but depends on validated calibration first.

Backtesting Approaches

Category	Recommended	Alternative	Why Not
Chronological split	Build in-house	k-fold cross-validation	Time-series data (student answers) has temporal dependencies. Random shuffling leaks future information. Chronological split respects the time-dependent nature of calibration updates.
Walk-forward validation	Build in-house	Single train/test split	Walk-forward better simulates the real online update pattern (`updateOnlineFromPaper`). The algorithm updates incrementally per grading event — a single split misses this dynamics.
Brier Score as primary metric	Use existing + extend	MSE / RMSE	Brier score is already implemented in the calibration service (`buildUpdateEvent`). It is the proper scoring rule for probabilistic predictions (difficulty = predicted error probability). MSE treats difficulty as a point estimate, losing the probabilistic interpretation.
Brier Score decomposition	Use math-php	Manual calculation	Decomposition into Uncertainty + Reliability + Resolution requires non-trivial binning and statistics. math-php provides the building blocks; build the decomposition logic on top.

Installation

# Core statistical library — only new dependency
composer require markrogoyski/math-php:^2.13

# Dev dependencies (already installed, no changes needed)
# phpunit/phpunit ^11.5.3
# mockery/mockery ^1.6

How Each Research Question Is Addressed

Q1: Statistical Validation Frameworks for IRT and Difficulty Calibration

Recommendation: Use math-php for validation metrics, not a full IRT framework.

The existing stratified_residual_eb_v2 algorithm is a proprietary hybrid (not standard 1PL/2PL/3PL IRT). Validation should focus on:

Calibration Accuracy Metrics (all computable with math-php + existing code):
- Brier Score — already implemented in buildUpdateEvent(). Extend with decomposition: Brier = Uncertainty - Resolution + Reliability. Lower reliability = better calibrated. Higher resolution = more discriminating.
- Pearson correlation (calibrated difficulty vs empirical error rate) — already in analyzer. Add p-value via math-php Significance::tpValue().
- Spearman rank correlation — add via math-php. More appropriate for the 5-level difficulty categories.
- Calibration-in-the-large — compare mean predicted difficulty to mean observed error rate. Simple but catches systematic bias.
- Calibration curves — bin predictions into deciles, plot predicted vs observed error rate. Visual diagnostic built into the backtest command.
Confidence Intervals (new, using math-php):
- Beta posterior credible interval around each calibrated difficulty value — BetaDistribution with parameters (weighted_wrong + 2, weighted_attempts - weighted_wrong + 2)
- If the credible interval is wide (low sample), the calibration should carry a confidence flag that the exam assembly pipeline can use to fall back to original difficulty
Health Monitoring (already partially implemented):
- getHealthScaleForType() monitors recent Brier score and log-loss trends
- Extend with formal statistical process control: track Brier score over rolling windows, flag when it exceeds 2 standard deviations from historical mean

Confidence: HIGH — These are standard psychometric validation techniques. math-php covers the computational needs. The existing codebase already implements the data pipeline.

Q2: Backtesting Approaches

Recommendation: Chronological walk-forward backtesting using existing data tables.

The system has historical answer data in paper_questions (with is_correct, graded_at) and papers (with difficulty_category). This is sufficient for backtesting.

Approach — Walk-Forward Validation:

Data preparation: Query all graded paper_questions ordered by graded_at ascending
Temporal split: Use first 70% chronologically as calibration training data, last 30% as holdout
Replay: Run estimateByStratifiedResidual() on training data to produce calibrated difficulties
Evaluate: For each holdout answer, compute Brier score using the calibrated difficulty as predicted error probability
Baseline comparison: Also compute Brier score using original questions.difficulty on the same holdout set
Report: Brier Skill Score = (Brier_original - Brier_calibrated) / Brier_original. Positive = calibration improved predictions. Negative = calibration made things worse.

Implementation plan:

New Artisan command questions:difficulty-backtest with options for --train-ratio, --min-attempts, --since
Reuses existing QuestionDifficultyCalibrationAnalyzer for per-question aggregation
Adds math-php Descriptive statistics for computing Brier components, confidence intervals
Outputs comparison table + JSON report

Key validation criteria:

Brier Skill Score > 0.05 (calibration provides meaningful improvement)
Pearson correlation of calibrated difficulty vs empirical error rate > 0.3 (moderate positive relationship)
No systematic bias: calibration-in-the-large difference < 0.05
Calibration improves for questions with >= 10 attempts (minimum sample for meaningful update)

Confidence: HIGH — The data exists in the database. The algorithm is already implemented. This is purely a validation harness.

Q3: Adaptive Testing Algorithms for Future Phases

Recommendation: Do not implement CAT now. Understand the landscape for Phase 3+ planning.

The current exam assembly pipeline uses a strategy-based approach: IntelligentExamController creates an AssembleExamTaskJob which selects questions based on knowledge points, chapters, difficulty categories, and distribution rules. This is a fixed-form assembly approach — all questions are determined before the student starts.

Standard CAT algorithms for reference (future use only):

Item Selection: Maximum Fisher Information — select the item that provides the most information about the student's ability at the current theta estimate. Requires IRT item parameters (a, b, c for 3PL model). The current stratified_residual_eb_v2 difficulty could serve as the b-parameter.
Ability Estimation:
- MLE (Maximum Likelihood Estimation) — no prior, can diverge with all-correct/all-wrong patterns
- EAP (Expected A Posteriori) — Bayesian, uses prior distribution. More stable. The existing Beta(2,2) prior in the calibration algorithm is conceptually similar.
- MAP (Maximum A Posteriori) — similar to EAP but uses mode instead of mean
Termination Criteria:
- Standard Error threshold (stop when ability estimate precision is sufficient)
- Fixed-length (current approach — stop after N questions)
- SPRT (Sequential Probability Ratio Test) — stop when enough evidence accumulated to classify into mastery category
Exposure Control:
- Randomesque — randomly select from K best items instead of the single best
- Sympson-Hetter — probabilistically control whether a selected item is actually administered
- Shadow testing — ensure remaining pool can still produce a valid exam

Why not CAT now: CAT requires (1) validated item parameters, (2) real-time ability estimation, (3) sequential item selection. The current system has none of these. Validating calibration first (Phase 1) gives us (1). Then connecting calibration to exam assembly (Phase 2) improves fixed-form quality. CAT (Phase 3+) would require architectural changes to the exam flow.

Confidence: MEDIUM — CAT theory is well-established in psychometric literature. Application to this specific K12 math context would need adaptation. The assessment is based on training data + Wikipedia verification of CAT components.

Q4: PHP/Laravel Libraries for Statistical Analysis

Recommendation: math-php ^2.13 as the sole addition.

Need	Solution	Source
Pearson/Spearman correlation	`math-php Correlation`	Built-in
P-values / significance	`math-php Significance`	Built-in
Beta distribution (for credible intervals)	`math-php BetaDistribution`	Built-in
Normal distribution (for confidence intervals)	`math-php NormalDistribution`	Built-in
Binomial distribution (for exact probability tests)	`math-php BinomialDistribution`	Built-in
Descriptive statistics	`math-php Descriptive`	Built-in
ANOVA (category comparison)	`math-php ANOVA`	Built-in
Linear regression (trend analysis)	`math-php LinearRegression`	Built-in
Brier score computation	Custom code (trivial: `mean((predicted - observed)^2)`)	~5 lines
Brier score decomposition	Custom code using math-php `Descriptive`	~30 lines
Walk-forward backtest engine	Custom Artisan command	~200 lines
Calibration curve generation	Custom code in backtest command	~50 lines

No other PHP libraries needed. The statistical requirements of this project are well within math-php's capabilities. Adding more libraries increases dependency surface area without meaningful benefit.

What NOT to Use

Technology	Why Not
Full IRT packages (R `ltm`, Python `py-irt`)	Would require rewriting the custom calibration algorithm. The algorithm works — validate it, don't replace it.
Machine learning frameworks (TensorFlow PHP, PHP-ML)	Overkill for what is fundamentally a statistical estimation problem. PHP-ML has limited statistical tools and poorer documentation than math-php.
External statistical services / APIs	Adds network dependency, latency, and deployment complexity for computations that take microseconds locally.
Custom C extensions for math	math-php is pure PHP with no dependencies. Performance is adequate for the data volumes in K12 education (thousands to tens of thousands of answer records, not millions).
`catsim` or any CAT library	Wrong phase. The system needs validated calibration before adaptive testing makes sense.
`phpscience/statistics`	Stale, fewer features, less maintained than math-php.
Any Python/R bridge	Adds deployment complexity. The math is simple enough for PHP.

Version Pinning Strategy

composer require markrogoyski/math-php:^2.13

Use caret (^) version constraint. math-php follows semantic versioning — ^2.13 allows 2.13.x and 2.14+ but not 3.0. The library has been stable across 2.x with no breaking changes in minor versions.

Migration Path

Phase 1 (Validation): Install math-php. Build backtest command. Validate calibration accuracy. No changes to production code path.
Phase 2 (Integration): Wire validated calibration into exam assembly via existing QuestionDifficultyResolver. Still using math-php only for monitoring/reporting.
Phase 3 (Intelligence): Add mastery-based difficulty recommendation. Potentially introduce simplified adaptive strategies. math-php supports the statistical needs throughout.

Sources

markrogoyski/math-php v2.13.0 on Packagist: https://packagist.org/packages/markrogoyski/math-php — HIGH confidence, verified directly
IRT theory: Wikipedia Item Response Theory article — HIGH confidence, well-established psychometric theory
CAT algorithms: Wikipedia Computerized Adaptive Testing article — HIGH confidence, standard reference
Brier Score: Wikipedia Brier Score article — HIGH confidence, proper scoring rule definition and decomposition
Existing codebase analysis: QuestionDifficultyCalibrationService.php (973 lines), QuestionDifficultyCalibrationAnalyzer.php (608 lines), IntelligentExamController.php (1267 lines), DifficultyDistributionService.php (219 lines), QuestionDifficultyResolver.php (88 lines) — HIGH confidence, direct code inspection

STACK.md 18 KB Historia Raaka