STACK.md 18 KB

Technology Stack

Project: Math CMS — Difficulty Calibration & Intelligent Exam Researched: 2026-04-16

Recommended Stack

Core Framework (existing — no changes)

Technology Version Purpose Why
Laravel Framework ^12.0 Backend application framework Already in production, entire codebase built on it. No reason to change.
PHP ^8.2 Runtime Already in production. Supports enums, readonly properties, named arguments, fibers — all useful for the statistical code.
MySQL existing Primary database Stores questions, papers, paper_questions, question_difficulty_calibrations, mistake_records. All calibration data lives here.
Redis + Predis existing Cache + queue Used for baseline caching ($this->baselineCache), async job queue for AssembleExamTaskJob.

Statistical Validation — Primary Addition

Technology Version Purpose Why
markrogoyski/math-php ^2.13 Pure-PHP math/statistics library The only viable PHP-native option. Provides correlation, significance testing, probability distributions (Beta, Normal, Binomial, Student's t), descriptive statistics, and ANOVA. No external dependencies. Required for Brier score decomposition, confidence intervals, and backtest validation.

Confidence: HIGH — Verified on Packagist (v2.13.0, actively maintained, 3k+ GitHub stars, pure PHP, no C extensions).

What math-php provides that the project needs

  1. Correlation & Significance Testing

    • Correlation::pearson() — validate calibrated difficulty vs empirical error rate (the analyzer already has a custom Pearson implementation; math-php's version is battle-tested and produces p-values)
    • Significance::rCritical() / Significance::tpValue() — determine whether observed correlations are statistically significant, not just numerically large
    • Correlation::spearman() — rank-based correlation, more robust for ordinal difficulty categories (0-4) than Pearson
  2. Descriptive Statistics

    • Descriptive::standardDeviation(), Descriptive::mean(), Descriptive::median(), Descriptive::interquartileRange() — for bin analysis and distribution sanity checks
    • Descriptive::coefficientOfVariation() — compare calibration stability across question types
  3. Probability Distributions

    • BetaDistribution — directly supports the Beta(2,2) prior used in stratified_residual_eb_v2; enables computing credible intervals around calibrated difficulty
    • NormalDistribution — for confidence interval construction around error rates
    • BinomialDistribution — for computing exact probability of observed correct/wrong counts given hypothesized difficulty
  4. ANOVA

    • ANOVA::oneWay() — test whether calibration deltas differ significantly across question types or difficulty categories
  5. Regression

    • LinearRegression::create() — for fitting calibration quality over time (is the system getting better?)

Testing Infrastructure (existing — extend)

Technology Version Purpose Why
PHPUnit ^11.5.3 Test framework Already in require-dev. Extend with calibration validation tests.
Mockery ^1.6 Test mocking Already in require-dev. For mocking DB queries in unit tests.

Backtesting Infrastructure (new — build internally)

Technology Version Purpose Why
Custom Artisan Command new questions:difficulty-backtest Historical replay of calibration algorithm. Split answer data chronologically, run algorithm on first N%, measure prediction accuracy on remaining data. No external library — this is domain-specific and should leverage existing QuestionDifficultyCalibrationAnalyzer.
Custom PHPUnit Test Suite new tests/Feature/DifficultyCalibrationBacktestTest.php Automated regression tests for calibration accuracy. Run in CI to catch algorithm regressions.

Alternatives Considered

Statistical Libraries

Category Recommended Alternative Why Not
PHP math library math-php ^2.13 Write custom implementations The codebase already has custom pearsonCorrelation() in the analyzer. math-php is more reliable, handles edge cases (zero-division, small samples), and provides p-values. Rewriting statistical functions is error-prone and wastes time.
PHP math library math-php ^2.13 phpscience/statistics Stale, fewer features, lower community adoption. math-php has 10x the feature coverage.
PHP math library math-php ^2.13 Call Python/R via shell Adds runtime dependency on external language, introduces serialization overhead, deployment complexity. The math needed here is not computationally heavy — pure PHP is sufficient.
PHP math library math-php ^2.13 Call external stats API Unnecessary network dependency for what are fundamentally simple statistical computations. Adds latency and failure modes.

IRT / CAT Libraries

Category Decision Alternative Why Not
Full IRT framework Do not adopt irt R package, py-irt Python The project uses a custom stratified_residual_eb_v2 algorithm, not standard IRT. Adopting a full IRT framework would mean rewriting the calibration pipeline. Current algorithm works — validate it, don't replace it.
CAT engine Do not adopt now catsim Python, mirtCAT R The exam assembly pipeline (IntelligentExamController) already has a working strategy-based approach. CAT requires a fundamentally different architecture (sequential item selection with real-time ability estimation). This is a future consideration, not current scope.
Adaptive testing Defer to Phase 3+ Standard CAT algorithms The current priority is validating calibration and wiring it into the existing assembly pipeline. Adaptive testing is the natural evolution but depends on validated calibration first.

Backtesting Approaches

Category Recommended Alternative Why Not
Chronological split Build in-house k-fold cross-validation Time-series data (student answers) has temporal dependencies. Random shuffling leaks future information. Chronological split respects the time-dependent nature of calibration updates.
Walk-forward validation Build in-house Single train/test split Walk-forward better simulates the real online update pattern (updateOnlineFromPaper). The algorithm updates incrementally per grading event — a single split misses this dynamics.
Brier Score as primary metric Use existing + extend MSE / RMSE Brier score is already implemented in the calibration service (buildUpdateEvent). It is the proper scoring rule for probabilistic predictions (difficulty = predicted error probability). MSE treats difficulty as a point estimate, losing the probabilistic interpretation.
Brier Score decomposition Use math-php Manual calculation Decomposition into Uncertainty + Reliability + Resolution requires non-trivial binning and statistics. math-php provides the building blocks; build the decomposition logic on top.

Installation

# Core statistical library — only new dependency
composer require markrogoyski/math-php:^2.13

# Dev dependencies (already installed, no changes needed)
# phpunit/phpunit ^11.5.3
# mockery/mockery ^1.6

How Each Research Question Is Addressed

Q1: Statistical Validation Frameworks for IRT and Difficulty Calibration

Recommendation: Use math-php for validation metrics, not a full IRT framework.

The existing stratified_residual_eb_v2 algorithm is a proprietary hybrid (not standard 1PL/2PL/3PL IRT). Validation should focus on:

  1. Calibration Accuracy Metrics (all computable with math-php + existing code):

    • Brier Score — already implemented in buildUpdateEvent(). Extend with decomposition: Brier = Uncertainty - Resolution + Reliability. Lower reliability = better calibrated. Higher resolution = more discriminating.
    • Pearson correlation (calibrated difficulty vs empirical error rate) — already in analyzer. Add p-value via math-php Significance::tpValue().
    • Spearman rank correlation — add via math-php. More appropriate for the 5-level difficulty categories.
    • Calibration-in-the-large — compare mean predicted difficulty to mean observed error rate. Simple but catches systematic bias.
    • Calibration curves — bin predictions into deciles, plot predicted vs observed error rate. Visual diagnostic built into the backtest command.
  2. Confidence Intervals (new, using math-php):

    • Beta posterior credible interval around each calibrated difficulty value — BetaDistribution with parameters (weighted_wrong + 2, weighted_attempts - weighted_wrong + 2)
    • If the credible interval is wide (low sample), the calibration should carry a confidence flag that the exam assembly pipeline can use to fall back to original difficulty
  3. Health Monitoring (already partially implemented):

    • getHealthScaleForType() monitors recent Brier score and log-loss trends
    • Extend with formal statistical process control: track Brier score over rolling windows, flag when it exceeds 2 standard deviations from historical mean

Confidence: HIGH — These are standard psychometric validation techniques. math-php covers the computational needs. The existing codebase already implements the data pipeline.

Q2: Backtesting Approaches

Recommendation: Chronological walk-forward backtesting using existing data tables.

The system has historical answer data in paper_questions (with is_correct, graded_at) and papers (with difficulty_category). This is sufficient for backtesting.

Approach — Walk-Forward Validation:

  1. Data preparation: Query all graded paper_questions ordered by graded_at ascending
  2. Temporal split: Use first 70% chronologically as calibration training data, last 30% as holdout
  3. Replay: Run estimateByStratifiedResidual() on training data to produce calibrated difficulties
  4. Evaluate: For each holdout answer, compute Brier score using the calibrated difficulty as predicted error probability
  5. Baseline comparison: Also compute Brier score using original questions.difficulty on the same holdout set
  6. Report: Brier Skill Score = (Brier_original - Brier_calibrated) / Brier_original. Positive = calibration improved predictions. Negative = calibration made things worse.

Implementation plan:

  • New Artisan command questions:difficulty-backtest with options for --train-ratio, --min-attempts, --since
  • Reuses existing QuestionDifficultyCalibrationAnalyzer for per-question aggregation
  • Adds math-php Descriptive statistics for computing Brier components, confidence intervals
  • Outputs comparison table + JSON report

Key validation criteria:

  • Brier Skill Score > 0.05 (calibration provides meaningful improvement)
  • Pearson correlation of calibrated difficulty vs empirical error rate > 0.3 (moderate positive relationship)
  • No systematic bias: calibration-in-the-large difference < 0.05
  • Calibration improves for questions with >= 10 attempts (minimum sample for meaningful update)

Confidence: HIGH — The data exists in the database. The algorithm is already implemented. This is purely a validation harness.

Q3: Adaptive Testing Algorithms for Future Phases

Recommendation: Do not implement CAT now. Understand the landscape for Phase 3+ planning.

The current exam assembly pipeline uses a strategy-based approach: IntelligentExamController creates an AssembleExamTaskJob which selects questions based on knowledge points, chapters, difficulty categories, and distribution rules. This is a fixed-form assembly approach — all questions are determined before the student starts.

Standard CAT algorithms for reference (future use only):

  1. Item Selection: Maximum Fisher Information — select the item that provides the most information about the student's ability at the current theta estimate. Requires IRT item parameters (a, b, c for 3PL model). The current stratified_residual_eb_v2 difficulty could serve as the b-parameter.

  2. Ability Estimation:

    • MLE (Maximum Likelihood Estimation) — no prior, can diverge with all-correct/all-wrong patterns
    • EAP (Expected A Posteriori) — Bayesian, uses prior distribution. More stable. The existing Beta(2,2) prior in the calibration algorithm is conceptually similar.
    • MAP (Maximum A Posteriori) — similar to EAP but uses mode instead of mean
  3. Termination Criteria:

    • Standard Error threshold (stop when ability estimate precision is sufficient)
    • Fixed-length (current approach — stop after N questions)
    • SPRT (Sequential Probability Ratio Test) — stop when enough evidence accumulated to classify into mastery category
  4. Exposure Control:

    • Randomesque — randomly select from K best items instead of the single best
    • Sympson-Hetter — probabilistically control whether a selected item is actually administered
    • Shadow testing — ensure remaining pool can still produce a valid exam

Why not CAT now: CAT requires (1) validated item parameters, (2) real-time ability estimation, (3) sequential item selection. The current system has none of these. Validating calibration first (Phase 1) gives us (1). Then connecting calibration to exam assembly (Phase 2) improves fixed-form quality. CAT (Phase 3+) would require architectural changes to the exam flow.

Confidence: MEDIUM — CAT theory is well-established in psychometric literature. Application to this specific K12 math context would need adaptation. The assessment is based on training data + Wikipedia verification of CAT components.

Q4: PHP/Laravel Libraries for Statistical Analysis

Recommendation: math-php ^2.13 as the sole addition.

Need Solution Source
Pearson/Spearman correlation math-php Correlation Built-in
P-values / significance math-php Significance Built-in
Beta distribution (for credible intervals) math-php BetaDistribution Built-in
Normal distribution (for confidence intervals) math-php NormalDistribution Built-in
Binomial distribution (for exact probability tests) math-php BinomialDistribution Built-in
Descriptive statistics math-php Descriptive Built-in
ANOVA (category comparison) math-php ANOVA Built-in
Linear regression (trend analysis) math-php LinearRegression Built-in
Brier score computation Custom code (trivial: mean((predicted - observed)^2)) ~5 lines
Brier score decomposition Custom code using math-php Descriptive ~30 lines
Walk-forward backtest engine Custom Artisan command ~200 lines
Calibration curve generation Custom code in backtest command ~50 lines

No other PHP libraries needed. The statistical requirements of this project are well within math-php's capabilities. Adding more libraries increases dependency surface area without meaningful benefit.

What NOT to Use

Technology Why Not
Full IRT packages (R ltm, Python py-irt) Would require rewriting the custom calibration algorithm. The algorithm works — validate it, don't replace it.
Machine learning frameworks (TensorFlow PHP, PHP-ML) Overkill for what is fundamentally a statistical estimation problem. PHP-ML has limited statistical tools and poorer documentation than math-php.
External statistical services / APIs Adds network dependency, latency, and deployment complexity for computations that take microseconds locally.
Custom C extensions for math math-php is pure PHP with no dependencies. Performance is adequate for the data volumes in K12 education (thousands to tens of thousands of answer records, not millions).
catsim or any CAT library Wrong phase. The system needs validated calibration before adaptive testing makes sense.
phpscience/statistics Stale, fewer features, less maintained than math-php.
Any Python/R bridge Adds deployment complexity. The math is simple enough for PHP.

Version Pinning Strategy

composer require markrogoyski/math-php:^2.13

Use caret (^) version constraint. math-php follows semantic versioning — ^2.13 allows 2.13.x and 2.14+ but not 3.0. The library has been stable across 2.x with no breaking changes in minor versions.

Migration Path

  1. Phase 1 (Validation): Install math-php. Build backtest command. Validate calibration accuracy. No changes to production code path.
  2. Phase 2 (Integration): Wire validated calibration into exam assembly via existing QuestionDifficultyResolver. Still using math-php only for monitoring/reporting.
  3. Phase 3 (Intelligence): Add mastery-based difficulty recommendation. Potentially introduce simplified adaptive strategies. math-php supports the statistical needs throughout.

Sources

  • markrogoyski/math-php v2.13.0 on Packagist: https://packagist.org/packages/markrogoyski/math-phpHIGH confidence, verified directly
  • IRT theory: Wikipedia Item Response Theory article — HIGH confidence, well-established psychometric theory
  • CAT algorithms: Wikipedia Computerized Adaptive Testing article — HIGH confidence, standard reference
  • Brier Score: Wikipedia Brier Score article — HIGH confidence, proper scoring rule definition and decomposition
  • Existing codebase analysis: QuestionDifficultyCalibrationService.php (973 lines), QuestionDifficultyCalibrationAnalyzer.php (608 lines), IntelligentExamController.php (1267 lines), DifficultyDistributionService.php (219 lines), QuestionDifficultyResolver.php (88 lines) — HIGH confidence, direct code inspection