The Queen of Hearts

We measure synthesis quality. Here are the numbers.

Every read PokerReads writes goes through a deterministic poker-correctness gate, a paired cross-model LLM judge, and a held-out predictive lane that scores how well each read compresses behavior into action prediction. We publish what comes out the other side. Below is the live scorecard from our most recent arena run. Numbers update as challenger prompts iterate; the methodology is auditable.

Last arena run: 2026-04-29 19:45 ET · methodology version v3 + self-check

Corpus size

Fixtures across six lanes: David's lived reads, Hand2Note stat-precision failures, online-club screenshot scanner failures, PLO + mixed-game rules, small-sample overclaiming, chat-import contamination. Floor target met.

Cross-model judge — challenger

23.46 /25

Average across the corpus, scored by an independent gpt-4o judge. Up from current production champion at 17.93/25. Win rate: 14/14 on Lane D.

Predictive lane top-3

100%

Held-out action prediction methodology (credit Caitlin Rollins). The challenger's top-three predicted actions include the actually-played action on every Lane D fixture.

Deterministic gate — catastrophic regressions

4 / 14 fixtures (Lane D)

The gate that has to clear before any prompt change touches production. Down from 10/14 baseline once the self-check retry stage was added. Promotion is blocked while this is non-zero on critical fixtures. Remaining failures: PLO hand-strength fixture coherence, PLO overbet language leakage, mixed-stakes stat collapse, Brian Rast mixed-game street terminology.

Calibration — Brier score

0.29 vs 0.40 champion

Lower is better. Measures how well the read's expressed confidence matches actual outcomes. Challenger's calibration is ~28% tighter than the live production prompt.

How the gates work

Every prompt change has to clear all four.

A challenger prompt that wins the LLM judge by a wide margin but breaks deterministic poker correctness does not promote. We've shipped that result twice. It's documented openly in our synthesis-evals folder.

Gate 1 · Deterministic

Poker-correctness graders run against the model's output text

PLO hand-strength evaluation (sets, trips, overpairs, front-door flush draws), RFI semantics (cold-calls and iso-raises after limps are not RFIs), mixed-game isolation (no PLO four-card combos described as NLHE two-card), no-showdown discipline (don't invent exact holdings for mucked hands), range consistency (every "X action = Y range" claim must fit every cited hand).

The self-check retry stage runs these graders against the synthesis OUTPUT — not just against fixtures — and feeds violations back as binding retry constraints. Capped at two retries before falling back to champion.

Gate 2 · Cross-model judge

An independent model scores both candidate and champion on the same rubric

25-point rubric: evidence accuracy, citation safety, decision surface preserved, exploit specificity without overclaim, sample-size discipline, voice (does this read like a serious player wrote it). Judge is gpt-4o; challenger and champion are scored blind. Win threshold for promotion: ≥ +2.0 / 25 average with zero critical regressions.

Gate 3 · Predictive lane

Reads compete on held-out action prediction

For each fixture, the synthesis is generated from the first n hands. The remaining hands are held out. The challenger's exploit lines are scored on whether the actually-played action falls in the top-1 / top-3 of predicted actions, with calibrated probability scoring.

Translation: a read that "sounds smart" but doesn't predict what the villain actually does does not promote.

Gate 4 · Spot-check

Human review of random outputs against the product-quality rubric

Four dimensions: decision surface preserved, serious-poker-notebook feel, actionable exploit without overclaim, balance between user-supplied intuition and hand-history evidence. If the human spot-check diverges from the LLM judge by ≥1 on any single dimension across multiple samples, the challenger pipeline freezes until we recalibrate the judge prompt.

Honest framing

What this scorecard proves — and what it doesn't.

What it proves

We measure poker correctness, not just fluency. The deterministic graders are publicly defined and the test fixtures are tracked in source control.
We block prompt changes from production when they regress on poker correctness, even if they win on subjective quality. We've shipped that block twice.
Calibration is tighter than the previous best by a meaningful margin. The numbers above are real outputs from the same arena Codex runs nightly.
We publish failure modes alongside successes. The 4/14 catastrophic-regression number isn't hidden — it's what's blocking the next promotion.

What it doesn't prove

That every read in your account is correct. Sample size matters. A read on 3 hands carries different confidence than a read on 30. We label that confidence on every read; the calibration score above is how well we do it.
That the synthesis layer replaces a tracker, a solver, or your own judgment. See where each tool wins →
That the methodology is the only correct way to measure synthesis quality. It's the most honest framework we've found. We rev it openly when we find a better one.

What changes when

The arena runs nightly in CI. The numbers above update when a new challenger prompt is iterated, evaluated, and either promoted or rejected. Failed challenger results are documented in our public synthesis-evals folder alongside successes.

Two prior prompt patches failed the deterministic gate even though they won on judge and predictive lane. They did not ship. Read the failure docs →

Numbers without product is just spreadsheets. See a real read on a synthetic opponent — no signup required.

See a real read → Built for pros →