# Technology Stack **Project:** Math CMS — Difficulty Calibration & Intelligent Exam **Researched:** 2026-04-16 ## Recommended Stack ### Core Framework (existing — no changes) | Technology | Version | Purpose | Why | |------------|---------|---------|-----| | Laravel Framework | ^12.0 | Backend application framework | Already in production, entire codebase built on it. No reason to change. | | PHP | ^8.2 | Runtime | Already in production. Supports enums, readonly properties, named arguments, fibers — all useful for the statistical code. | | MySQL | existing | Primary database | Stores `questions`, `papers`, `paper_questions`, `question_difficulty_calibrations`, `mistake_records`. All calibration data lives here. | | Redis + Predis | existing | Cache + queue | Used for baseline caching (`$this->baselineCache`), async job queue for `AssembleExamTaskJob`. | ### Statistical Validation — Primary Addition | Technology | Version | Purpose | Why | |------------|---------|---------|-----| | **markrogoyski/math-php** | ^2.13 | Pure-PHP math/statistics library | **The only viable PHP-native option.** Provides correlation, significance testing, probability distributions (Beta, Normal, Binomial, Student's t), descriptive statistics, and ANOVA. No external dependencies. Required for Brier score decomposition, confidence intervals, and backtest validation. | **Confidence: HIGH** — Verified on Packagist (v2.13.0, actively maintained, 3k+ GitHub stars, pure PHP, no C extensions). #### What math-php provides that the project needs 1. **Correlation & Significance Testing** - `Correlation::pearson()` — validate calibrated difficulty vs empirical error rate (the analyzer already has a custom Pearson implementation; math-php's version is battle-tested and produces p-values) - `Significance::rCritical()` / `Significance::tpValue()` — determine whether observed correlations are statistically significant, not just numerically large - `Correlation::spearman()` — rank-based correlation, more robust for ordinal difficulty categories (0-4) than Pearson 2. **Descriptive Statistics** - `Descriptive::standardDeviation()`, `Descriptive::mean()`, `Descriptive::median()`, `Descriptive::interquartileRange()` — for bin analysis and distribution sanity checks - `Descriptive::coefficientOfVariation()` — compare calibration stability across question types 3. **Probability Distributions** - `BetaDistribution` — directly supports the Beta(2,2) prior used in `stratified_residual_eb_v2`; enables computing credible intervals around calibrated difficulty - `NormalDistribution` — for confidence interval construction around error rates - `BinomialDistribution` — for computing exact probability of observed correct/wrong counts given hypothesized difficulty 4. **ANOVA** - `ANOVA::oneWay()` — test whether calibration deltas differ significantly across question types or difficulty categories 5. **Regression** - `LinearRegression::create()` — for fitting calibration quality over time (is the system getting better?) ### Testing Infrastructure (existing — extend) | Technology | Version | Purpose | Why | |------------|---------|---------|-----| | PHPUnit | ^11.5.3 | Test framework | Already in `require-dev`. Extend with calibration validation tests. | | Mockery | ^1.6 | Test mocking | Already in `require-dev`. For mocking DB queries in unit tests. | ### Backtesting Infrastructure (new — build internally) | Technology | Version | Purpose | Why | |------------|---------|---------|-----| | Custom Artisan Command | new | `questions:difficulty-backtest` | Historical replay of calibration algorithm. Split answer data chronologically, run algorithm on first N%, measure prediction accuracy on remaining data. No external library — this is domain-specific and should leverage existing `QuestionDifficultyCalibrationAnalyzer`. | | Custom PHPUnit Test Suite | new | `tests/Feature/DifficultyCalibrationBacktestTest.php` | Automated regression tests for calibration accuracy. Run in CI to catch algorithm regressions. | ## Alternatives Considered ### Statistical Libraries | Category | Recommended | Alternative | Why Not | |----------|-------------|-------------|---------| | PHP math library | math-php ^2.13 | Write custom implementations | The codebase already has custom `pearsonCorrelation()` in the analyzer. math-php is more reliable, handles edge cases (zero-division, small samples), and provides p-values. Rewriting statistical functions is error-prone and wastes time. | | PHP math library | math-php ^2.13 | `phpscience/statistics` | Stale, fewer features, lower community adoption. math-php has 10x the feature coverage. | | PHP math library | math-php ^2.13 | Call Python/R via shell | Adds runtime dependency on external language, introduces serialization overhead, deployment complexity. The math needed here is not computationally heavy — pure PHP is sufficient. | | PHP math library | math-php ^2.13 | Call external stats API | Unnecessary network dependency for what are fundamentally simple statistical computations. Adds latency and failure modes. | ### IRT / CAT Libraries | Category | Decision | Alternative | Why Not | |----------|----------|-------------|---------| | Full IRT framework | **Do not adopt** | `irt` R package, `py-irt` Python | The project uses a custom `stratified_residual_eb_v2` algorithm, not standard IRT. Adopting a full IRT framework would mean rewriting the calibration pipeline. Current algorithm works — validate it, don't replace it. | | CAT engine | **Do not adopt now** | `catsim` Python, `mirtCAT` R | The exam assembly pipeline (`IntelligentExamController`) already has a working strategy-based approach. CAT requires a fundamentally different architecture (sequential item selection with real-time ability estimation). This is a future consideration, not current scope. | | Adaptive testing | **Defer to Phase 3+** | Standard CAT algorithms | The current priority is validating calibration and wiring it into the existing assembly pipeline. Adaptive testing is the natural evolution but depends on validated calibration first. | ### Backtesting Approaches | Category | Recommended | Alternative | Why Not | |----------|-------------|-------------|---------| | Chronological split | Build in-house | k-fold cross-validation | Time-series data (student answers) has temporal dependencies. Random shuffling leaks future information. Chronological split respects the time-dependent nature of calibration updates. | | Walk-forward validation | Build in-house | Single train/test split | Walk-forward better simulates the real online update pattern (`updateOnlineFromPaper`). The algorithm updates incrementally per grading event — a single split misses this dynamics. | | Brier Score as primary metric | Use existing + extend | MSE / RMSE | Brier score is already implemented in the calibration service (`buildUpdateEvent`). It is the proper scoring rule for probabilistic predictions (difficulty = predicted error probability). MSE treats difficulty as a point estimate, losing the probabilistic interpretation. | | Brier Score decomposition | Use math-php | Manual calculation | Decomposition into Uncertainty + Reliability + Resolution requires non-trivial binning and statistics. math-php provides the building blocks; build the decomposition logic on top. | ## Installation ```bash # Core statistical library — only new dependency composer require markrogoyski/math-php:^2.13 # Dev dependencies (already installed, no changes needed) # phpunit/phpunit ^11.5.3 # mockery/mockery ^1.6 ``` ## How Each Research Question Is Addressed ### Q1: Statistical Validation Frameworks for IRT and Difficulty Calibration **Recommendation: Use math-php for validation metrics, not a full IRT framework.** The existing `stratified_residual_eb_v2` algorithm is a proprietary hybrid (not standard 1PL/2PL/3PL IRT). Validation should focus on: 1. **Calibration Accuracy Metrics** (all computable with math-php + existing code): - **Brier Score** — already implemented in `buildUpdateEvent()`. Extend with decomposition: Brier = Uncertainty - Resolution + Reliability. Lower reliability = better calibrated. Higher resolution = more discriminating. - **Pearson correlation** (calibrated difficulty vs empirical error rate) — already in analyzer. Add p-value via math-php `Significance::tpValue()`. - **Spearman rank correlation** — add via math-php. More appropriate for the 5-level difficulty categories. - **Calibration-in-the-large** — compare mean predicted difficulty to mean observed error rate. Simple but catches systematic bias. - **Calibration curves** — bin predictions into deciles, plot predicted vs observed error rate. Visual diagnostic built into the backtest command. 2. **Confidence Intervals** (new, using math-php): - Beta posterior credible interval around each calibrated difficulty value — `BetaDistribution` with parameters `(weighted_wrong + 2, weighted_attempts - weighted_wrong + 2)` - If the credible interval is wide (low sample), the calibration should carry a confidence flag that the exam assembly pipeline can use to fall back to original difficulty 3. **Health Monitoring** (already partially implemented): - `getHealthScaleForType()` monitors recent Brier score and log-loss trends - Extend with formal statistical process control: track Brier score over rolling windows, flag when it exceeds 2 standard deviations from historical mean **Confidence: HIGH** — These are standard psychometric validation techniques. math-php covers the computational needs. The existing codebase already implements the data pipeline. ### Q2: Backtesting Approaches **Recommendation: Chronological walk-forward backtesting using existing data tables.** The system has historical answer data in `paper_questions` (with `is_correct`, `graded_at`) and `papers` (with `difficulty_category`). This is sufficient for backtesting. **Approach — Walk-Forward Validation:** 1. **Data preparation**: Query all graded `paper_questions` ordered by `graded_at` ascending 2. **Temporal split**: Use first 70% chronologically as calibration training data, last 30% as holdout 3. **Replay**: Run `estimateByStratifiedResidual()` on training data to produce calibrated difficulties 4. **Evaluate**: For each holdout answer, compute Brier score using the calibrated difficulty as predicted error probability 5. **Baseline comparison**: Also compute Brier score using original `questions.difficulty` on the same holdout set 6. **Report**: Brier Skill Score = `(Brier_original - Brier_calibrated) / Brier_original`. Positive = calibration improved predictions. Negative = calibration made things worse. **Implementation plan:** - New Artisan command `questions:difficulty-backtest` with options for `--train-ratio`, `--min-attempts`, `--since` - Reuses existing `QuestionDifficultyCalibrationAnalyzer` for per-question aggregation - Adds math-php `Descriptive` statistics for computing Brier components, confidence intervals - Outputs comparison table + JSON report **Key validation criteria:** - Brier Skill Score > 0.05 (calibration provides meaningful improvement) - Pearson correlation of calibrated difficulty vs empirical error rate > 0.3 (moderate positive relationship) - No systematic bias: calibration-in-the-large difference < 0.05 - Calibration improves for questions with >= 10 attempts (minimum sample for meaningful update) **Confidence: HIGH** — The data exists in the database. The algorithm is already implemented. This is purely a validation harness. ### Q3: Adaptive Testing Algorithms for Future Phases **Recommendation: Do not implement CAT now. Understand the landscape for Phase 3+ planning.** The current exam assembly pipeline uses a **strategy-based approach**: `IntelligentExamController` creates an `AssembleExamTaskJob` which selects questions based on knowledge points, chapters, difficulty categories, and distribution rules. This is a **fixed-form assembly** approach — all questions are determined before the student starts. **Standard CAT algorithms for reference (future use only):** 1. **Item Selection**: Maximum Fisher Information — select the item that provides the most information about the student's ability at the current theta estimate. Requires IRT item parameters (a, b, c for 3PL model). The current `stratified_residual_eb_v2` difficulty could serve as the b-parameter. 2. **Ability Estimation**: - MLE (Maximum Likelihood Estimation) — no prior, can diverge with all-correct/all-wrong patterns - EAP (Expected A Posteriori) — Bayesian, uses prior distribution. More stable. The existing Beta(2,2) prior in the calibration algorithm is conceptually similar. - MAP (Maximum A Posteriori) — similar to EAP but uses mode instead of mean 3. **Termination Criteria**: - Standard Error threshold (stop when ability estimate precision is sufficient) - Fixed-length (current approach — stop after N questions) - SPRT (Sequential Probability Ratio Test) — stop when enough evidence accumulated to classify into mastery category 4. **Exposure Control**: - Randomesque — randomly select from K best items instead of the single best - Sympson-Hetter — probabilistically control whether a selected item is actually administered - Shadow testing — ensure remaining pool can still produce a valid exam **Why not CAT now**: CAT requires (1) validated item parameters, (2) real-time ability estimation, (3) sequential item selection. The current system has none of these. Validating calibration first (Phase 1) gives us (1). Then connecting calibration to exam assembly (Phase 2) improves fixed-form quality. CAT (Phase 3+) would require architectural changes to the exam flow. **Confidence: MEDIUM** — CAT theory is well-established in psychometric literature. Application to this specific K12 math context would need adaptation. The assessment is based on training data + Wikipedia verification of CAT components. ### Q4: PHP/Laravel Libraries for Statistical Analysis **Recommendation: math-php ^2.13 as the sole addition.** | Need | Solution | Source | |------|----------|--------| | Pearson/Spearman correlation | `math-php Correlation` | Built-in | | P-values / significance | `math-php Significance` | Built-in | | Beta distribution (for credible intervals) | `math-php BetaDistribution` | Built-in | | Normal distribution (for confidence intervals) | `math-php NormalDistribution` | Built-in | | Binomial distribution (for exact probability tests) | `math-php BinomialDistribution` | Built-in | | Descriptive statistics | `math-php Descriptive` | Built-in | | ANOVA (category comparison) | `math-php ANOVA` | Built-in | | Linear regression (trend analysis) | `math-php LinearRegression` | Built-in | | Brier score computation | Custom code (trivial: `mean((predicted - observed)^2)`) | ~5 lines | | Brier score decomposition | Custom code using math-php `Descriptive` | ~30 lines | | Walk-forward backtest engine | Custom Artisan command | ~200 lines | | Calibration curve generation | Custom code in backtest command | ~50 lines | **No other PHP libraries needed.** The statistical requirements of this project are well within math-php's capabilities. Adding more libraries increases dependency surface area without meaningful benefit. ## What NOT to Use | Technology | Why Not | |------------|---------| | Full IRT packages (R `ltm`, Python `py-irt`) | Would require rewriting the custom calibration algorithm. The algorithm works — validate it, don't replace it. | | Machine learning frameworks (TensorFlow PHP, PHP-ML) | Overkill for what is fundamentally a statistical estimation problem. PHP-ML has limited statistical tools and poorer documentation than math-php. | | External statistical services / APIs | Adds network dependency, latency, and deployment complexity for computations that take microseconds locally. | | Custom C extensions for math | math-php is pure PHP with no dependencies. Performance is adequate for the data volumes in K12 education (thousands to tens of thousands of answer records, not millions). | | `catsim` or any CAT library | Wrong phase. The system needs validated calibration before adaptive testing makes sense. | | `phpscience/statistics` | Stale, fewer features, less maintained than math-php. | | Any Python/R bridge | Adds deployment complexity. The math is simple enough for PHP. | ## Version Pinning Strategy ``` composer require markrogoyski/math-php:^2.13 ``` Use caret (`^`) version constraint. math-php follows semantic versioning — `^2.13` allows `2.13.x` and `2.14+` but not `3.0`. The library has been stable across 2.x with no breaking changes in minor versions. ## Migration Path 1. **Phase 1 (Validation)**: Install math-php. Build backtest command. Validate calibration accuracy. No changes to production code path. 2. **Phase 2 (Integration)**: Wire validated calibration into exam assembly via existing `QuestionDifficultyResolver`. Still using math-php only for monitoring/reporting. 3. **Phase 3 (Intelligence)**: Add mastery-based difficulty recommendation. Potentially introduce simplified adaptive strategies. math-php supports the statistical needs throughout. ## Sources - markrogoyski/math-php v2.13.0 on Packagist: https://packagist.org/packages/markrogoyski/math-php — **HIGH confidence**, verified directly - IRT theory: Wikipedia Item Response Theory article — **HIGH confidence**, well-established psychometric theory - CAT algorithms: Wikipedia Computerized Adaptive Testing article — **HIGH confidence**, standard reference - Brier Score: Wikipedia Brier Score article — **HIGH confidence**, proper scoring rule definition and decomposition - Existing codebase analysis: `QuestionDifficultyCalibrationService.php` (973 lines), `QuestionDifficultyCalibrationAnalyzer.php` (608 lines), `IntelligentExamController.php` (1267 lines), `DifficultyDistributionService.php` (219 lines), `QuestionDifficultyResolver.php` (88 lines) — **HIGH confidence**, direct code inspection