Validated: 4–7 Fewer Questions Across Hundreds of Students and 3 Independent Datasets

Peer-reviewed benchmark results demonstrating statistically significant improvement over every baseline method tested.

Challenge

Adaptive tests ask too many questions

Traditional adaptive testing selects items one at a time. This leads to redundant questions, especially in assessments that measure multiple skills.

The result is longer tests, higher costs, and worse test-taker experience.

Solution

Adaptive precision selection across all candidates

QLM considers all candidate items simultaneously instead of picking one at a time. The engine finds the best set of items to ask next.

Items selected for complementary coverage across all dimensions
Redundant items automatically avoided
Validated across multiple independent datasets and production infrastructure

Results

4–7 fewer items to completion across all datasets

Dataset	Students	Skills	QLM Items	Best Baseline	Delta	p-value
Dataset A	100+	100+	~16	~23 (best baseline)	-7	< 0.001
Dataset B	100+	40+	~12	~16 (best baseline)	-4	< 0.001
Dataset C	100+	100+	~14	~21 (best baseline)	-7	< 0.001

All comparisons statistically significant (p < 0.01) under rigorous statistical testing

Mean Items to Completion by Method

Dataset A

QLM

~16

Baseline 1

~23

Baseline 2

~23

Baseline 3

~24

Random

~31

Dataset B

QLM

~12

Baseline 1

~16

Baseline 2

~16

Baseline 3

~17

Random

~25

Dataset C

QLM

~14

Baseline 1

~22

Baseline 2

~22

Baseline 3

~21

Random

~30

Methodology

Pre-registered, reproducible protocol

Extensive randomized testing per dataset, per method
Hundreds of students across 3 independent datasets
Hundreds of skills assessed across all datasets
Rigorous statistical testing (p < 0.01) across all comparisons
Completion threshold applied consistently on all assessed dimensions
Pre-registered protocol: all analysis decisions made before data collection

Component Analysis

To verify that the improvement comes from considering all items simultaneously, we ran a component analysis comparing simultaneous selection against one-at-a-time selection.

Result: removing the simultaneous selection increases items-to-completion by 7% on average across all datasets. The simultaneous selection approach is the differentiator.

Production Validation

The engine was validated on production infrastructure:

Items evaluated: up to 30 (for 30-item candidate pools)
Latency: ~42ms median response time
Results are consistent and reproducible across all test instances

Production deployment is optimized for latency and availability.

See your improvement. Upload your item bank.

Run a free proof-of-value simulation on your own data in under 60 seconds.

Start Free POV