Methodology

The measurement science behind Percentile.

Test prep is full of unverifiable claims. This page is our answer: the models we use, why we chose them, how our item bank is built and audited, and how we measure whether any of it works. Written for the technically curious; sources and assumptions stated.

1. Ability estimation: IRT with a 3PL model

Percentile models student ability with item response theory (IRT), the same psychometric framework used by major standardized testing programs. Each question in the bank carries a three-parameter logistic (3PL) characterization: a discrimination parameter (how sharply the item separates students near its difficulty), a difficulty parameter (where on the ability scale it provides the most information), and a guessing parameter (the floor probability of answering correctly by chance on a four-option multiple-choice item).

After each response we update the student's ability estimate per content domain using expected a posteriori (EAP) scoring: a posterior mean over a quadrature grid, with a standard normal prior for new users that is quickly dominated by response data. EAP is well-behaved with short response strings, which matters because we re-estimate continuously during study, not just after full tests. The posterior standard deviation is retained and propagated; every predicted score we display carries its uncertainty rather than hiding it.

Ability estimates are mapped to the 400 to 1600 scale through section curves modeled on published digital SAT conversions, including the module-2 routing behavior of the adaptive test. These curves are estimates and are labeled as such in the product; they are tightened over time by equating against verified official score reports (see section 6).

2. Blueprint-weighted adaptive selection

A pure information-maximizing item selector overfits to whatever a student is currently bad at. The real exam, however, samples content domains in roughly fixed proportions, published in the test specifications. Percentile's selector therefore optimizes two objectives jointly: choose items near the student's current ability frontier (maximizing measurement information and learning efficiency), while keeping each session's domain mix anchored to the official test blueprint, with deliberate overweighting of domains where the expected score gain per minute of study is highest.

Expected gain is computed from three quantities per skill: the blueprint weight (how much the domain matters on the real test), the mastery gap (distance between current and target mastery), and item-level learning value estimated from historical response patterns. The result is a session that feels like focused tutoring but never silently abandons low-frequency domains, a failure mode common in naive adaptive systems.

3. Retention scheduling: FSRS-6

Learning a concept once is cheap; still knowing it on test day is the hard part. Percentile schedules review using FSRS-6 (Free Spaced Repetition Scheduler), the current generation of the open-source scheduler that has consistently outperformed classic SM-2-family algorithms in published benchmarks on hundreds of millions of real reviews.

FSRS models each concept-student pair with three state variables: difficulty, stability (how long the memory lasts), and retrievability (the current probability of successful recall). After every review the state updates from the observed outcome, and the next review is scheduled to arrive when retrievability decays to a target threshold. In practice this means easy concepts recede to monthly check-ins while fragile ones return within days, and total review load stays bounded as the studied set grows. Retrievability also feeds the mastery model below: knowledge you have not refreshed in six weeks is discounted accordingly, rather than counted at face value.

4. Mastery and uncertainty: Beta-Binomial with confidence intervals

For each tested skill we maintain a Beta-Binomial mastery model: a Beta posterior over the probability of answering items of that skill correctly, updated from response outcomes that are weighted by recency, item difficulty, and FSRS retrievability. A correct answer on a hard item moves the posterior more than one on an easy item; stale evidence decays.

The Beta posterior gives us calibrated uncertainty for free. Skill masteries roll up through blueprint weights into section and total score predictions, and the posterior variance propagates into a credible interval around the predicted score plus a probability of reaching the student's target. We report the interval, not just the point estimate, because a 1380 ± 60 and a 1380 ± 15 call for very different decisions about test dates and retakes. Students also rate answer confidence periodically; comparing stated confidence against observed accuracy yields a per-student calibration curve used to flag overconfident skills for extra verification reviews.

5. The item bank: AI-generated, adversarially verified, bias-audited

Our items are original works authored with frontier language models under a clean-room protocol: authors work from the public test specifications and our own content taxonomy, never from real exam items or third-party prep material, and every item carries primary-source grounding for its factual content. Items are then fact-checked and attacked by a separate, stronger verifier model whose job is adversarial: confirm the keyed answer is uniquely correct, confirm each distractor is genuinely wrong but plausibly chosen, and reject items with ambiguity, cultural-knowledge dependence, or flawed mathematics. Items failing any check are discarded, not patched.

Generated banks have characteristic statistical tells, so we audit and neutralize them explicitly. Correct answers are length-matched to distractors (no “longest answer wins”), answer-key positions are balanced and randomized per administration, distractor rationales are re-keyed so explanation quality cannot leak the answer, and surface phrasing that correlates with correctness is rewritten. A held-out set of complete practice forms is reserved exclusively for scored tests, so practice exposure never contaminates score prediction.

Finally, the bank is a living instrument. Live response data feeds ongoing psychometric recalibration: empirical difficulty and discrimination re-estimated from the response matrix, automatic flagging of items with poor point-biserial correlations or differential performance across student groups, and retirement of items that fail. Students can flag any item, and flagged items are human-reviewed.

6. The efficacy study: pre-registered, verified, published

Most prep companies advertise score gains measured on their own practice tests, from self-reported surveys, or from cherry-picked testimonials. We consider that practice close to meaningless, so we run a pre-registered efficacy study instead, with the design fixed and time-stamped before data collection.

  • Baseline: the entry diagnostic, taken before any study, plus any prior official score the student reports.
  • Exposure: study hours and activity mix tracked automatically by the platform, not self-reported.
  • Outcome: official College Board score reports, verified from the document rather than self-reported numbers.
  • Population: the primary analysis is completers-only, defined by pre-registered engagement thresholds, and we say so plainly: it measures what the program does for students who actually do the program. Attrition rates and an intention-to-treat sensitivity analysis are reported alongside.
  • Reporting: mean improvement with confidence intervals, the score-gain distribution, and subgroup breakdowns by baseline score, published openly whether the results flatter us or not.

Until the first cohort's results are published, we make no quantitative improvement claims, and any number you see from us will arrive with its interval attached.

7. Limitations, honestly

Our scaled-score curves are estimates until enough verified official reports accumulate for tight equating. IRT parameters for new items begin at model-estimated priors and only converge to stable empirical values after sufficient live responses. Predicted scores are predictions: they carry intervals because they are wrong by some amount, in both directions, by design. And no scheduling algorithm can substitute for time on task; the platform makes study hours dramatically more efficient, but it cannot make them optional.

Questions, corrections, or methodological critiques are genuinely welcome at hello@percentileprep.com. For the plain-English version of this page, see how Percentile works.

The diagnostic is the first data point.

Take the free diagnostic and see the models above applied to your own responses: a predicted score with an honest interval and a concept-level mastery map.

Take the free diagnostic