# Architecture Patterns

**Domain:** K12 math difficulty calibration and intelligent exam matching
**Researched:** 2026-04-16
**Confidence:** HIGH (based on direct codebase analysis); MEDIUM (general adaptive testing patterns from training data)

## Current Architecture (As-Built)

```
                           ANSWER FLOW (already working)
                           ============================

  Student answers exam
         |
         v
  ExamAnswerAnalysisService.analyzeExamAnswers()
         |
         +---> MasteryCalculator (knowledge point mastery)
         +---> KnowledgeMasteryService (persist mastery)
         +---> LocalAIAnalysisService (update mastery)
         +---> MistakeBookService (add to mistake book)
         +---> QuestionDifficultyCalibrationService.updateOnlineFromPaper()
                  |
                  v
         question_difficulty_calibrations table (upsert)


                           ASSEMBLY FLOW (partially working)
                           ================================

  POST /api/intelligent-exam
         |
         v
  IntelligentExamController.store()
         |
         v
  AssembleExamTaskJob (queued)
         |
         v
  LearningAnalyticsService.generateIntelligentExam()
         |
         +---> selectQuestions() -- uses raw questions.difficulty
         +---> applyTypeAwareDifficultyDistribution()
         |         |
         |         v
         |    DifficultyDistributionService
         |    (only if enable_difficulty_distribution=true,
         |     which defaults to FALSE)
         |
         v
  QuestionDifficultyResolver.applyCalibratedDifficulty()
  (exists but NOT called in the main assembly path)
```

### Identified Gaps in Current Architecture

| Gap | Location | Impact |
|-----|----------|--------|
| Calibration values not used in assembly | `LearningAnalyticsService` selects questions using raw `questions.difficulty` | Assembled exams use uncalibrated difficulty |
| `enable_difficulty_distribution` defaults false | `LearningAnalyticsService` line 1554 | Distribution strategy never activates unless caller explicitly enables |
| No auto difficulty_category recommendation | No service maps mastery to category | Teachers must manually pick tier; no student-level adaptation |
| No backtesting validation | `QuestionDifficultyCalibrationAnalyzer` reports but does not validate | Algorithm accuracy unknown before production use |
| Dual difficulty scale (0-1 vs 0-5) | `normalizeDifficultyValue()` divides by 5 if > 1.0 | Inconsistent source data enters calibration |

## Recommended Target Architecture

### Component Diagram

```
+===================================================================================+
|                              TARGET ARCHITECTURE                                   |
+===================================================================================+

 LAYER 1: VALIDATION (must complete before anything else)
 +--------------------------------------------------------+
 |                                                        |
 |  CalibrationBacktestService                            |
 |    |-- backtestAgainstHistory(cutoffDate)              |
 |    |-- computeBrierScores(questionIds)                 |
 |    |-- computePearsonCorrelation()                     |
 |    |-- produceValidationReport()                       |
 |    +-- PASS/FAIL gate: algo accuracy threshold         |
 |                                                        |
 |  Data: reads paper_questions + questions (historical)  |
 |  Writes: backtest_results table (or JSON export)       |
 +--------------------------------------------------------+
           |
           | PASS gate opens production use
           v

 LAYER 2: CALIBRATION FEEDBACK LOOP (enhance existing)
 +--------------------------------------------------------+
 |                                                        |
 |  ExamAnswerAnalysisService                            |
 |    |-- (existing) analyzeExamAnswers()                 |
 |    +-- (existing) recalibrateQuestionDifficulty()      |
 |                                                        |
 |  QuestionDifficultyCalibrationService                  |
 |    |-- updateOnlineFromPaper()   [existing, per-paper] |
 |    |-- recalibrateQuestionIds()  [existing, batch]     |
 |    +-- getHealthScaleForType()   [existing, monitoring]|
 |                                                        |
 |  NEW: CalibrationVerificationGate                      |
 |    |-- validateCalibratedRange(questionIds)             |
 |    |-- flagOutliers(threshold)                          |
 |    +-- quarantineBadCalibrations()                      |
 |                                                        |
 |  Data: answer --> calibrate --> verify --> use         |
 +--------------------------------------------------------+
           |
           v

 LAYER 3: ASSEMBLY INTEGRATION (connect calibration to exam)
 +--------------------------------------------------------+
 |                                                        |
 |  DifficultyNormalizationService  [NEW]                 |
 |    |-- normalize(questionId) -> float [0,1]            |
 |    |-- batchNormalize(questionIds) -> map              |
 |    +-- resolves 0-1 vs 0-5 ambiguity at read time     |
 |                                                        |
 |  QuestionDifficultyResolver  [existing, expand usage]  |
 |    |-- applyCalibratedDifficulty(questions) -> arr     |
 |    +-- MUST be called in assembly path                 |
 |                                                        |
 |  LearningAnalyticsService                              |
 |    |-- generateIntelligentExam()                       |
 |    |   +-- CALL DifficultyNormalizationService first    |
 |    |   +-- CALL QuestionDifficultyResolver second       |
 |    |   +-- SET enable_difficulty_distribution = true    |
 |    +-- remove hard-coded default false                 |
 |                                                        |
 |  DifficultyDistributionService  [existing]             |
 |    |-- calculateDistribution(category, total)          |
 |    +-- groupQuestionsByDifficultyRange()               |
 +--------------------------------------------------------+
           |
           v

 LAYER 4: ADAPTIVE MATCHING (mastery-based difficulty selection)
 +--------------------------------------------------------+
 |                                                        |
 |  DifficultyCategoryRecommender  [NEW]                  |
 |    |-- recommendForStudent(studentId, kpCodes) -> cat  |
 |    |-- recommendForKnowledgePoint(studentId, kp) -> cat|
 |    +-- uses MasteryCalculator + calibration data       |
 |                                                        |
 |  MasteryCalculator  [existing]                         |
 |    |-- calculateMasteryLevel(studentId, kpCode)        |
 |    +-- returns mastery [0,1] + confidence + trend      |
 |                                                        |
 |  Mapping logic:                                        |
 |    mastery [0.0, 0.30) -> category 0 (zero-foundation) |
 |    mastery [0.30, 0.50) -> category 1 (foundation)     |
 |    mastery [0.50, 0.70) -> category 2 (intermediate)   |
 |    mastery [0.70, 0.85) -> category 3 (advanced)       |
 |    mastery [0.85, 1.00) -> category 4 (competition)    |
 +--------------------------------------------------------+
           |
           v

 LAYER 5: HEALTH MONITORING (continuous)
 +--------------------------------------------------------+
 |                                                        |
 |  CalibrationHealthMonitor  [NEW]                       |
 |    |-- detectDrift(windowDays) -> drift report         |
 |    |-- accuracyTrend(days) -> accuracy over time       |
 |    |-- calibrationCoverage() -> % questions calibrated |
 |    +-- scheduled artisan command (daily/weekly)        |
 |                                                        |
 |  Existing health mechanisms:                           |
 |    |-- getHealthScaleForType() in CalibrationService   |
 |    +-- recent_events in algorithm_meta (per-question)  |
 |                                                        |
 |  NEW: calibration_health_snapshots table               |
 |    |-- date, total_calibrated, avg_brier,              |
 |    |   coverage_pct, drift_flag, action                |
 +--------------------------------------------------------+
```

### Component Boundaries

| Component | Responsibility | Communicates With | New/Existing |
|-----------|---------------|-------------------|--------------|
| `CalibrationBacktestService` | Validate algorithm accuracy against historical data | Reads `paper_questions`, `questions`, `papers`. Writes report output. | NEW |
| `QuestionDifficultyCalibrationService` | Core calibration algorithm (stratified_residual_eb_v2) | Called by `ExamAnswerAnalysisService`, `CalibrationBacktestService` | EXISTING |
| `CalibrationVerificationGate` | Post-calibration sanity checks (range, outlier detection) | Reads `question_difficulty_calibrations`. Flags problematic entries. | NEW |
| `DifficultyNormalizationService` | Unify 0-1 / 0-5 scale at read boundary | Called by `LearningAnalyticsService` during question loading | NEW |
| `QuestionDifficultyResolver` | Apply calibrated difficulty to question arrays, calibrated-first | Called in assembly path by `LearningAnalyticsService` | EXISTING (needs wiring) |
| `DifficultyDistributionService` | Calculate difficulty buckets per category | Called by `LearningAnalyticsService` when distribution enabled | EXISTING |
| `DifficultyCategoryRecommender` | Map student mastery to recommended difficulty category | Reads `MasteryCalculator`. Used by `IntelligentExamController` | NEW |
| `MasteryCalculator` | Calculate per-knowledge-point mastery levels | Existing, unchanged | EXISTING |
| `CalibrationHealthMonitor` | Detect calibration drift, coverage gaps, accuracy degradation | Reads `question_difficulty_calibrations`. Writes health snapshots. | NEW |
| `LearningAnalyticsService` | Orchestrate question selection and difficulty distribution | Must call normalization + resolver + distribution | EXISTING (needs modification) |

### Data Flow

```
QUESTION DIFFICULTY LIFECYCLE
=============================

  questions.difficulty (original, immutable)
         |
         v
  DifficultyNormalizationService.normalize()
         |  (resolves 0-1 vs 0-5, stores original_difficulty)
         v
  question_difficulty_calibrations.original_difficulty
         |
         +---[calibration loop]---> calibrated_difficulty
         |                              |
         v                              v
  QuestionDifficultyResolver.applyCalibratedDifficulty()
         |
         |  Returns: calibrated if exists, else original (normalized)
         v
  DifficultyDistributionService.groupQuestionsByDifficultyRange()
         |
         v
  Selected questions for exam assembly


ANSWER-TO-CALIBRATION FEEDBACK LOOP
====================================

  Student submits exam answers
         |
         v
  ExamAnswerAnalysisService.analyzeExamAnswers()
         |
         +---> MasteryCalculator (update knowledge mastery)
         +---> QuestionDifficultyCalibrationService.updateOnlineFromPaper()
                  |
                  +-- per-question: compute residual, apply shrinkage, clamp
                  +-- upsert to question_difficulty_calibrations
                  +-- append to recent_events in algorithm_meta
                  |
                  v
         CalibrationVerificationGate (NEW)
                  |
                  +-- check calibrated_difficulty in [0.01, 0.99]
                  +-- flag if delta > 0.30 from original
                  +-- quarantine if Brier score deteriorating
                  |
                  v
         Health monitor caches invalidated
         (getHealthScaleForType will recompute next call)


BACKTESTING VALIDATION FLOW
============================

  CalibrationBacktestService.backtestAgainstHistory(cutoffDate)
         |
         v
  1. Load all questions with >= N attempts before cutoffDate
  2. Split: training set (before cutoff) vs test set (after cutoff)
  3. Run calibration on training data only
  4. For each question in test set:
     - predicted = calibrated_difficulty from training
     - actual = observed error rate in test period
     - Brier score = (predicted - actual)^2
  5. Aggregate metrics:
     - Mean Brier score (lower = better, < 0.15 is acceptable)
     - Pearson correlation (predicted vs actual, > 0.4 is acceptable)
     - Calibration coverage (% questions with enough data)
     - MAE (mean absolute error, < 0.15 is acceptable)
  6. PASS gate:
     - Pearson > 0.3 AND Mean Brier < 0.20
     - If FAIL: algorithm needs tuning, do NOT enable in production
         |
         v
  Report: JSON/CSV output + PASS/FAIL verdict


MASTERY-TO-DIFFICULTY MATCHING FLOW
====================================

  Exam request (student_id + kp_codes)
         |
         v
  DifficultyCategoryRecommender.recommendForStudent()
         |
         +-- for each kp_code:
         |    MasteryCalculator.calculateMasteryLevel(studentId, kp)
         |    -> mastery [0,1], confidence, trend
         |
         +-- aggregate mastery across kp_codes (weighted average)
         +-- map to category:
         |    mastery -> category via threshold table
         |    adjust for trend: trending up -> +0.5 category push
         |    floor at 0, cap at 4
         |
         +-- return: recommended category + confidence + reasoning
         |
         v
  IntelligentExamController uses recommended category
         |
         v
  LearningAnalyticsService with enable_difficulty_distribution=true
```

## Patterns to Follow

### Pattern 1: Gate-Based Progressive Activation

**What:** A calibration value must pass validation before it can influence production behavior. Once validated, components progressively unlock.
**When:** Any system where unvalidated statistical estimates would harm user experience.
**Why this matters here:** The project explicitly requires "validation before production use." The current code already has calibration running, but it is not connected to assembly. This is correct; the backtest gate formalizes the transition.

```
Gate states:
  LOCKED   -- calibration runs, values stored, NOT used in assembly
  TESTED   -- backtest passed, enable for shadow mode (log but don't act)
  ACTIVE   -- fully enabled in production assembly path
```

**Example implementation:**

```php
// In a config or database table
'calibration_gate' => 'locked',  // locked | tested | active

// In LearningAnalyticsService, during assembly:
if (config('calibration_gate') === 'active') {
    $questions = $resolver->applyCalibratedDifficulty($questions);
}
```

### Pattern 2: Difficulty Source Priority Chain

**What:** When multiple difficulty values exist for a question, follow a deterministic priority chain rather than ad-hoc logic.
**When:** Any lookup where calibrated, original, and estimated values coexist.
**Why:** The current `QuestionDifficultyResolver` already implements this pattern correctly (calibrated > original). It just needs to be consistently called.

```php
// Priority chain (already implemented in QuestionDifficultyResolver):
// 1. calibrated_difficulty (from question_difficulty_calibrations)
// 2. normalized questions.difficulty (0-1 scale, divide-by-5 if needed)
// 3. fallback 0.5 (moderate default)
```

### Pattern 3: Shadow Mode Before Activation

**What:** Before enabling calibrated difficulty in actual exam assembly, run both paths in parallel and compare results without affecting output.
**When:** Connecting a validated but previously-disconnected statistical system to production.
**Why:** Even with backtest validation, real-time behavior may differ from historical backtest. Shadow mode catches integration bugs.

```php
// In LearningAnalyticsService assembly:
$rawQuestions = $selectedQuestions; // current behavior
$calibratedQuestions = $resolver->applyCalibratedDifficulty($rawQuestions);

// Log comparison without using calibrated values yet
Log::info('Shadow mode difficulty comparison', [
    'raw_avg' => collect($rawQuestions)->avg('difficulty'),
    'calibrated_avg' => collect($calibratedQuestions)->avg('difficulty'),
    'diff_count' => count(array_filter($calibratedQuestions, fn($q) =>
        ($q['difficulty_source'] ?? '') === 'calibrated'
    )),
]);

// Use raw (unchanged behavior) until gate opens
$selectedQuestions = $rawQuestions;
```

### Pattern 4: Stratified Baseline with Residual Adjustment

**What:** Already implemented in the codebase. The calibration algorithm computes expected error rates per (question_type, difficulty_category) stratum, then adjusts based on the residual (observed - expected).
**When:** This is the core calibration algorithm. No changes needed to the algorithm itself per project scope.

The existing algorithm is well-structured:
- Global baselines: `buildGlobalBaselines()` computes per-stratum error rates
- Online update: `estimateOnlineBySingleOutcome()` processes one answer event
- Batch update: `estimateByStratifiedResidual()` processes historical data
- Health scaling: `getHealthScaleForType()` auto-reduces step size when degrading

### Pattern 5: Time-Decay Weighted Statistics

**What:** Weight recent observations more heavily than old ones using exponential decay. Already implemented with 45-day half-life.
**When:** Any aggregation of student performance or calibration data.
**Why:** K12 students improve; old responses are less predictive. The existing 45-day half-life is reasonable.

## Anti-Patterns to Avoid

### Anti-Pattern 1: Backfilling questions.difficulty

**What:** Writing calibrated values back to `questions.difficulty`.
**Why bad:** Destroys the original reference value, makes debugging impossible, violates project constraint.
**Instead:** Keep the dual-table design. `questions.difficulty` is append-only immutable. `question_difficulty_calibrations` is the mutable overlay.

### Anti-Pattern 2: Global Difficulty Category Override

**What:** Computing one difficulty_category for a student across all knowledge points and applying it everywhere.
**Why bad:** A student may be advanced in algebra but beginner in geometry. Global category creates mismatched exams.
**Instead:** Per-knowledge-point mastery -> per-knowledge-point difficulty recommendation. Aggregate only when exam spans multiple knowledge points.

### Anti-Pattern 3: Calibration Without Verification

**What:** Wiring calibration directly into the assembly path without the backtest validation step.
**Why bad:** If the algorithm has systematic bias (e.g., always overestimates difficulty for certain question types), it makes exams worse, not better.
**Instead:** Backtest first. The backtest is a prerequisite gate, not an optional report.

### Anti-Pattern 4: Dual-Scale Leakage

**What:** Mixing 0-5 scale difficulty values with 0-1 scale values in the same computation.
**Why bad:** A 0-5 value of 0.4 (easy) gets treated as 0.4 on 0-1 scale (hard), producing inverted difficulty estimates.
**Instead:** Normalize at the read boundary. The existing `normalizeDifficultyValue()` in `QuestionDifficultyCalibrationService` handles this for calibration input, but `LearningAnalyticsService` does not normalize when loading questions for assembly. This must be fixed.

### Anti-Pattern 5: Calibration Feedback Loop Without Rate Limiting

**What:** Allowing calibration to update on every single answer without any dampening.
**Why bad:** A single anomalous cohort (e.g., a class that all guesses randomly) can corrupt calibration values.
**Instead:** The existing algorithm handles this well with: shrinkage (prior pulls toward original), step limits, minimum sample thresholds, and health scaling. Do not remove these safeguards.

## Scalability Considerations

| Concern | At 100 questions | At 10K questions | At 100K questions |
|---------|------------------|-------------------|-------------------|
| Calibration table size | Negligible | ~10K rows, fast with index on `question_bank_id` | ~100K rows; add composite index on `(calibrated_difficulty, updated_at)` |
| Backtest computation | < 1 second | 5-30 seconds depending on attempt count | Minutes; run as queued job, cache results |
| Per-answer calibration | < 10ms (single upsert) | < 10ms | < 10ms (indexed lookup + single upsert) |
| Health monitoring | Negligible (scans recent rows) | 1-5 seconds (parsing algorithm_meta JSON) | 5-30 seconds; extract health metrics to dedicated columns |
| Mastery-to-category recommendation | < 50ms (1 mastery lookup) | < 50ms | < 100ms (batch mastery lookup for multiple kp_codes) |
| `applyCalibratedDifficulty` batch | < 5ms | < 20ms (WHERE IN) | < 100ms; add chunking for > 1000 IDs |

## Suggested Build Order

```
Phase 1: VALIDATION (must be first -- blocks everything else)
  1.1  CalibrationBacktestService
       - Reads historical data
       - Computes Pearson, Brier, MAE
       - Produces PASS/FAIL report
  DEPENDS ON: existing QuestionDifficultyCalibrationService, existing data
  BLOCKS: Phase 2, 3, 4

Phase 2: DIFFICULTY STANDARDIZATION (no behavioral change)
  2.1  DifficultyNormalizationService
       - Extract and centralize normalizeDifficultyValue() logic
       - Apply at question-loading boundary in LearningAnalyticsService
  DEPENDS ON: nothing new
  BLOCKS: Phase 3 (need consistent scale before using calibration)

Phase 3: ASSEMBLY INTEGRATION (wires calibration into production)
  3.1  Wire QuestionDifficultyResolver into LearningAnalyticsService
       - Call applyCalibratedDifficulty() in the assembly path
       - Enable difficulty_distribution by default
       - Add shadow mode logging first, then activate
  3.2  CalibrationVerificationGate
       - Post-upsert sanity checks
       - Outlier quarantine
  DEPENDS ON: Phase 1 (PASS gate), Phase 2 (consistent scale)
  BLOCKS: Phase 4

Phase 4: ADAPTIVE MATCHING (new feature)
  4.1  DifficultyCategoryRecommender
       - Map mastery -> category
       - Per-kp and aggregate recommendations
  4.2  Wire into IntelligentExamController
       - Auto-fill difficulty_category when not specified
  DEPENDS ON: Phase 3 (calibrated assembly working)

Phase 5: HEALTH MONITORING (ongoing)
  5.1  CalibrationHealthMonitor
       - Scheduled artisan command
       - Drift detection, coverage tracking
       - calibration_health_snapshots table
  5.2  Alert logic
       - Flag when Brier degrades, coverage drops, drift detected
  DEPENDS ON: Phase 3 (need production calibration data flowing)
```

### Dependency Graph

```
Phase 1 (Validation)
    |
    v
Phase 2 (Standardization) -----> Phase 3 (Assembly Integration)
                                       |
                                       v
                                Phase 4 (Adaptive Matching)
                                       |
                                       v
                                Phase 5 (Health Monitoring)
```

Phase 2 can run in parallel with Phase 1 since it does not depend on validation results. It only centralizes existing normalization logic. Phase 3 requires both Phase 1 PASS and Phase 2 completion. Phases 4 and 5 are sequential after Phase 3.

## Data Model Additions

### New Table: `calibration_health_snapshots`

```sql
CREATE TABLE calibration_health_snapshots (
    id BIGINT UNSIGNED AUTO_INCREMENT PRIMARY KEY,
    snapshot_date DATE NOT NULL,
    total_questions INT UNSIGNED DEFAULT 0,
    calibrated_count INT UNSIGNED DEFAULT 0,
    coverage_pct DECIMAL(5,2) DEFAULT 0,
    avg_brier_score DECIMAL(8,6) DEFAULT NULL,
    avg_logloss DECIMAL(8,6) DEFAULT NULL,
    pearson_correlation DECIMAL(8,4) DEFAULT NULL,
    mean_abs_residual DECIMAL(8,4) DEFAULT NULL,
    health_scale_avg DECIMAL(5,3) DEFAULT NULL,
    drift_flag TINYINT(1) DEFAULT 0,
    drift_details JSON DEFAULT NULL,
    action VARCHAR(32) DEFAULT 'none',
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    UNIQUE KEY idx_snapshot_date (snapshot_date)
);
```

### New Table: `backtest_results`

```sql
CREATE TABLE backtest_results (
    id BIGINT UNSIGNED AUTO_INCREMENT PRIMARY KEY,
    run_id VARCHAR(64) NOT NULL,
    cutoff_date DATE NOT NULL,
    question_bank_id BIGINT UNSIGNED NOT NULL,
    training_attempts INT UNSIGNED DEFAULT 0,
    test_attempts INT UNSIGNED DEFAULT 0,
    predicted_difficulty DECIMAL(6,4) DEFAULT NULL,
    observed_error_rate DECIMAL(6,4) DEFAULT NULL,
    brier_score DECIMAL(8,6) DEFAULT NULL,
    absolute_error DECIMAL(6,4) DEFAULT NULL,
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    INDEX idx_run_id (run_id),
    INDEX idx_cutoff (cutoff_date)
);
```

No changes needed to existing tables. The `question_difficulty_calibrations` table schema is sufficient.

## Key Design Decisions

### Decision 1: Calibration is an Overlay, Not a Replacement

The calibrated difficulty is an overlay on top of the original difficulty. The `QuestionDifficultyResolver` already implements this correctly: calibrated value takes priority, original value is the fallback. This must remain the design. Never write calibrated values back to `questions.difficulty`.

### Decision 2: Gate-Based Activation, Not Feature Flags

Use a deterministic gate (backtest PASS/FAIL) rather than a manual feature flag to enable calibration in production. The gate should be an artisan command that sets a config value or database flag after validation passes. This prevents human error from enabling an unvalidated algorithm.

### Decision 3: Per-Knowledge-Point Difficulty Recommendation

When recommending difficulty_category for a student, compute per-knowledge-point mastery and map each to a category. If an exam covers multiple knowledge points, use the weighted average of their recommended categories, weighted by the student's weakness level (weaker knowledge points get more weight to avoid overwhelming the student).

### Decision 4: Health Monitoring is Separate from Calibration

The existing `getHealthScaleForType()` already provides inline health adjustment. The new `CalibrationHealthMonitor` serves a different purpose: longitudinal tracking and alerting. It should NOT modify calibration behavior directly; instead, it produces reports that humans review to decide if algorithm parameters need adjustment.

### Decision 5: Backtesting Uses Temporal Split, Not Random Split

When validating the calibration algorithm, split data by time (cutoff date) rather than randomly. This is critical because:
1. The algorithm includes time decay, so temporal ordering matters
2. Random splits would leak future information into training
3. Real deployment processes data chronologically

## Sources

- Direct codebase analysis of all referenced PHP service files (HIGH confidence)
- Existing `question_difficulty_calibrations` migration schema (HIGH confidence)
- Adaptive testing and IRT architecture patterns from training data (MEDIUM confidence -- standard patterns in psychometrics literature)
- Brier score and calibration validation approaches from training data (MEDIUM confidence -- well-established statistical methodology)