Psychometric Validation

Methodology, benchmark results, and fairness analysis for the QLM question selection engine.

Measurement Model

How the Engine Scores Items and Learners

The QLM engine uses a multi-parameter model for item scoring. Each item in your bank is characterized by two core properties:

  • Difficulty — How hard the item is, on a continuous scale. Items with higher difficulty require greater ability to answer correctly.
  • Discrimination — How well the item differentiates between learners of different ability levels. High-discrimination items are more informative; low-discrimination items provide less measurement value.

Learner ability is estimated from response patterns using precision scoring. After each response, the engine updates its estimate of the learner's ability in the relevant domain, along with a measurement of how confident it is in that estimate. This allows the engine to select the next item that will produce the greatest reduction in measurement uncertainty.

Measurement Confidence

When Is the Measurement Precise Enough?

The engine tracks a standard error value for each learner's ability estimate. This value represents how much uncertainty remains in the measurement. With each additional response, the standard error decreases — meaning the measurement becomes more precise.

  • Confidence threshold — You set the precision level required for your use case. High-stakes placement exams may require very low standard error; low-stakes formative checks can tolerate more uncertainty.
  • Early stopping — When the standard error drops below your threshold, the engine signals that sufficient confidence has been reached. No more items are needed.
  • Efficient measurement — By selecting the most informative item at each step, the engine reaches confidence with fewer items than sequential or random selection.

Benchmark Results

We validated the QLM engine against three published educational datasets that are widely used in the research community. In each case, we measured how many items were needed to reach the same measurement confidence as baseline approaches.

ASSISTments 2009
346,860 interactions · 4,217 students · 26,688 items
38%
Fewer items needed
Same measurement confidence as fixed-length baseline with 38% fewer items administered
EdNet KT1
95M+ interactions · 784,309 students · 13,169 items
42%
Fewer items needed
Largest publicly available educational dataset. Consistent item reduction across all difficulty ranges.
Junyi Academy 2018
16M+ interactions · 247,606 students · 722 exercises
35%
Fewer items needed
Multilingual dataset from Taiwan-based platform. Strong performance on non-English content.

Baseline comparison: fixed-length assessment with random item selection within domain. All measurements at equivalent standard error threshold (SE ≤ 0.30). Full methodology available on request.

Fairness Analysis

Differential Item Functioning (DIF)

QLM performs continuous fairness monitoring using the Mantel-Haenszel procedure for detecting differential item functioning. DIF analysis identifies items that behave differently across demographic groups after controlling for ability level.

ETS Classification DIF Magnitude Action
Category A Negligible DIF No action required. Item is fair across groups.
Category B Moderate DIF Flagged for review. Item may be biased but magnitude is small.
Category C Large DIF Item is flagged and removed from selection until reviewed by your psychometrics team.

Unlike traditional annual DIF studies, QLM performs this analysis continuously as response data accumulates. This means biased items are identified and flagged in weeks rather than waiting for the next annual review cycle.

Standards Alignment

Professional Standards Compliance

The QLM engine is designed consistent with the AERA/APA/NCME Standards for Educational and Psychological Testing (2014), particularly the requirements for:

  • Validity — Evidence that the engine's item selections produce scores that accurately reflect the construct being measured. Selection decisions are driven by measurement precision, not convenience or arbitrary ordering.
  • Reliability — Consistent measurement across administrations. Continuous calibration ensures item parameters reflect current population characteristics, not historical averages.
  • Fairness — Continuous DIF monitoring (described above) and item selection that accounts for content balance to prevent construct-irrelevant variance.
  • Documentation — Full audit trail of every selection decision, calibration update, and fairness flag, retrievable via API for your institutional records.

Limitations

Known Limitations and Requirements

We believe in transparency about what the engine can and cannot do. These limitations are inherent to the measurement approach and apply to any system using similar methodology.

  • Sample size for stable calibration: Item parameters require a minimum of approximately 200 responses per item to reach stable calibration. Items with fewer responses will have wider confidence intervals on their parameters, which reduces (but does not eliminate) the engine's selection quality.
  • Cold-start for new items: When new items are added to your bank with no prior response data, the engine relies on your provided difficulty estimate (or a default of 0.5) until sufficient responses accumulate. We recommend adding new items in small batches and monitoring their calibration trajectory.
  • Measurement precision at extremes: Learners at the very top or very bottom of the ability distribution are harder to measure precisely, because fewer items exist that provide useful information at those levels. The engine mitigates this by selecting items closest to the learner's estimated ability, but precision at the extremes will always be lower than at the center of the distribution.
  • Construct coverage: The engine optimizes for measurement precision, not content coverage. If your use case requires that specific topics or skills be represented on every test form (e.g., for curricular validity), you must specify content constraints in the API request.
  • Item quality dependency: The engine selects the best available items from your pool. If the underlying item quality is low (poor discrimination, ambiguous stems, incorrect keys), the engine will work with what it has but cannot compensate for fundamental item construction problems.

See It on Your Data

Run a free pilot with your own item bank and measure the improvement firsthand.

Request Sandbox Access