We measure synthesis quality. Here are the numbers.
Every read PokerReads writes goes through a deterministic poker-correctness gate, a paired cross-model LLM judge, and a held-out predictive lane that scores how well each read compresses behavior into action prediction. We publish what comes out the other side. Below is the live scorecard from our most recent arena run. Numbers update as challenger prompts iterate; the methodology is auditable.
gpt-4o judge. Up from current production champion at 17.93/25. Win rate: 14/14 on Lane D.10/14 baseline once the self-check retry stage was added. Promotion is blocked while this is non-zero on critical fixtures. Remaining failures: PLO hand-strength fixture coherence, PLO overbet language leakage, mixed-stakes stat collapse, Brian Rast mixed-game street terminology.Every prompt change has to clear all four.
A challenger prompt that wins the LLM judge by a wide margin but breaks deterministic poker correctness does not promote. We've shipped that result twice. It's documented openly in our synthesis-evals folder.
Poker-correctness graders run against the model's output text
PLO hand-strength evaluation (sets, trips, overpairs, front-door flush draws), RFI semantics (cold-calls and iso-raises after limps are not RFIs), mixed-game isolation (no PLO four-card combos described as NLHE two-card), no-showdown discipline (don't invent exact holdings for mucked hands), range consistency (every "X action = Y range" claim must fit every cited hand).
The self-check retry stage runs these graders against the synthesis OUTPUT — not just against fixtures — and feeds violations back as binding retry constraints. Capped at two retries before falling back to champion.
An independent model scores both candidate and champion on the same rubric
25-point rubric: evidence accuracy, citation safety, decision surface preserved, exploit specificity without overclaim, sample-size discipline, voice (does this read like a serious player wrote it). Judge is gpt-4o; challenger and champion are scored blind. Win threshold for promotion: ≥ +2.0 / 25 average with zero critical regressions.
Reads compete on held-out action prediction
For each fixture, the synthesis is generated from the first n hands. The remaining hands are held out. The challenger's exploit lines are scored on whether the actually-played action falls in the top-1 / top-3 of predicted actions, with calibrated probability scoring.
Translation: a read that "sounds smart" but doesn't predict what the villain actually does does not promote.
Human review of random outputs against the product-quality rubric
Four dimensions: decision surface preserved, serious-poker-notebook feel, actionable exploit without overclaim, balance between user-supplied intuition and hand-history evidence. If the human spot-check diverges from the LLM judge by ≥1 on any single dimension across multiple samples, the challenger pipeline freezes until we recalibrate the judge prompt.
What this scorecard proves — and what it doesn't.
What it proves
- We measure poker correctness, not just fluency. The deterministic graders are publicly defined and the test fixtures are tracked in source control.
- We block prompt changes from production when they regress on poker correctness, even if they win on subjective quality. We've shipped that block twice.
- Calibration is tighter than the previous best by a meaningful margin. The numbers above are real outputs from the same arena Codex runs nightly.
- We publish failure modes alongside successes. The 4/14 catastrophic-regression number isn't hidden — it's what's blocking the next promotion.
What it doesn't prove
- That every read in your account is correct. Sample size matters. A read on 3 hands carries different confidence than a read on 30. We label that confidence on every read; the calibration score above is how well we do it.
- That the synthesis layer replaces a tracker, a solver, or your own judgment. See where each tool wins →
- That the methodology is the only correct way to measure synthesis quality. It's the most honest framework we've found. We rev it openly when we find a better one.
What changes when
The arena runs nightly in CI. The numbers above update when a new challenger prompt is iterated, evaluated, and either promoted or rejected. Failed challenger results are documented in our public synthesis-evals folder alongside successes.
Two prior prompt patches failed the deterministic gate even though they won on judge and predictive lane. They did not ship. Read the failure docs →
Numbers without product is just spreadsheets. See a real read on a synthetic opponent — no signup required.