ChessBench

[PREVIEW]

A New Chess Benchmark for Language Models

About ChessBench

ChessBench measures how well language models play chess.

It's a window into what language models are becoming.

Why chess

It's not obvious that language models should be able to play chess, and it's not clear why they would improve over time or how high their ceiling is. And yet, despite not being trained to do so, they can play chess -- sometimes badly, sometimes in ways that are harder to dismiss -- and they are getting better.

This trend is worth tracking carefully, because what's happening isn't well understood. Language models are trained to predict text, and chess asks for things that next-token prediction does not obviously reward: planning, memory, and restraint. The fact that these abilities are emerging and improving anyway raises real questions about what these systems are actually learning, and what that learning looks like internally.

Chess is a useful setting for these questions. Positions are structured enough to probe meaningfully: is a model retrieving memorized lines, generalizing from patterns, or doing something that looks more like reasoning? The task is well-defined, which makes those distinctions easier to draw than most natural-language settings allow.

It also has practical advantages as a benchmark. No deployed model is near the ceiling, and novel positions are cheap to generate, which makes it easy to test outside the training distribution. The space of possible board states is too large to have been memorized; performance on unseen positions is a direct test of generalization ability.

If language models are developing something like reasoning, it will show up somewhere with clear enough structure to detect it. ChessBench is that place.

A research framework, not just a leaderboard

The point of ChessBench is to look closely at how language models play chess, not only to rank them. Each game is run as an experiment, with the conditions around every move -- how the board is described, what context is provided, how the request is phrased -- deliberately varied and recorded alongside the move itself. The resulting archive supports questions a leaderboard can't: capabilities that don't show up in a final rating often surface when you can compare play across these conditions.

The framework itself is built to be extended. New prompting strategies, presentation variables, and ways of describing positions are added as new questions arise. It's the mechanism by which we investigate what these models are latently capable of, and how to get the best out of them. This has been the primary motivation for the project from the start.

The ranking on the home page is one slice of what the underlying archive can answer.

A note on this preview

ChessBench is a preview. The methodology will evolve, players and metrics will be added, and the ratings visible today are a snapshot rather than a verdict. The most lasting thing about the project is not this particular leaderboard but the archive of games and positions it is building -- an archive designed so that tomorrow's questions about how language models play chess can be asked of yesterday's games. Visitors who care about a particular model, matchup, or research question are invited to surface it on the community discussions board; the questions that arrive there shape what the project looks at next.

Frequently asked questions

How often is the data updated?

New data arrives almost daily; check the footer for the date of the latest update.

What do the metrics mean?

Coherence measures how reliably a model produces legal chess moves -- both move-by-move and across whole games -- without regard to how good those moves are. Of the legal moves a model makes, Accuracy is a measure of how strong they are, on average. Elo ratings are a view into how well each model plays chess relative to others, based on the outcomes of all completed games.

  • Coherence -- how reliably a language model produces legal chess moves. Combines two components:
    • Move coherence -- the fraction of the model's attempted moves that were legal, across every game it has played.
    • Game coherence -- the fraction of the model's games finished with no illegal attempts, restricted to games where the opponent also played fully legally.
    • Formula: coherence = move_coherence × max(0.01, game_coherence).
  • Accuracy -- how close a model's legal moves come to optimal play, scored against Stockfish's evaluation of every position before and after each move. Accuracy is a weighted average of three sub-metrics, each measuring a different kind of position:
    • Branching accuracy (weight 10) -- ordinary positions, with no forced mate available to either side. Scored by centipawn loss relative to Stockfish's preferred move, winsorized so a single missed tactic can't dominate the average.
    • Converting accuracy (weight 9) -- positions where the model already has a forced mate. Scored by how many extra moves the model takes to deliver mate versus optimal play, with shorter mates weighted more heavily (a wasted tempo in a mate-in-3 is more telling than the same tempo in a mate-in-12).
    • Resisting accuracy (weight 1) -- positions where the opponent has a forced mate. Scored by how quickly the model lets the mate distance shrink under best-defense play.
    • Each sub-metric is rescaled so that Random Randy sits at 0 and perfect play sits at 1, then combined into a single accuracy score using the weights above. Sub-metrics with too few qualifying positions to score reliably are dropped from the average and the remaining weights renormalize.
  • Elo -- the classical chess rating, derived from game outcomes. Sequentially applying Elo updates is order-dependent, so ChessBench takes a Monte Carlo approach: it shuffles the full game history many times, recomputes ratings from scratch on each replay, and reports each player's median across runs. The scale is anchored to independently established values by a handful of reference players, with a hard floor at 0 beneath them; both are described below. (Note: A future iteration is likely to move away from Monte Carlo Elo replay toward a Bradley-Terry estimation, which sidesteps order-dependence entirely by fitting all ratings jointly to the full set of game outcomes.)

Why is Random Randy here?

A rating scale needs a floor, and "plays legal moves at random" is the most honest one available. If a model can't reliably beat Random Randy under the conditions it's being evaluated in, that itself is worth knowing.

Worth noting, too: uniform random play sits at a special spot on the chess-skill axis. It uses no information about the position when choosing a move, which makes it, in a precise mathematical sense, the least intelligent possible way to play. Anything that consistently does worse has to first recognize good moves in order to systematically avoid them -- which is itself a form of chess knowledge. Random play isn't an arbitrary floor; it marks the genuine bottom of the intelligence scale -- at least, for play that stays inside the rules. A model that loses to Random Randy by forfeiting on illegal moves rather than by being outplayed is failing in a different and arguably worse way, since random play at least always produces legal moves. That kind of underperformance is real, and it shows up directly in the coherence metric -- but it also drags Elo down, since forfeited games are still losses. A model with poor enough coherence can rate below Random Randy on Elo too. So the floor isn't strictly enforced; it's the floor for players that can stay inside the rules.

Why do most models rate worse than expected?

A few reasons, working from the methodological to the structural.

The first is that ChessBench is deliberately ungenerous. A move that doesn't parse as legal forfeits the game for the player that attempted it, and language models attempt illegal moves more often than the rest of their behavior would lead you to expect. There are no retries and no partial credit. If these systems are to become broadly more capable than humans, needing redos to finish a game of chess seems like the wrong thing to forgive.

The second sits behind the first. Chess asks language models for things that next-token prediction does not obviously reward: maintaining an accurate internal representation of the board across dozens of moves, planning several plies ahead, and restraining the impulse to play the most typical-looking move when the position calls for something else. That these abilities are emerging at all is remarkable; that they're still uneven enough to leave most models below a random-move baseline tells you how early it is.

A note on Claude Opus 4.6 and 4.7. Both performed surprisingly poorly at first glance, but the underlying data is more interesting than the headline rating suggests:

  • Without legal moves provided in the prompt, Claude Opus models are far worse than other frontier models under the same conditions, on average.
  • When legal moves are provided, their legal-move rate jumps from roughly 8% to roughly 73% -- the largest gain among frontier models.
  • Among the legal moves Claude Opus 4.7 does make, accuracy is competitive with other top models. See the accuracy timeline.

The pattern looks less like a chess weakness and more like a prompting one: these models need a little scaffolding before the rest of their capability shows up, and once it does, they go further than most.

Are there players in the system that don't show up on the site?

Yes. Behind the scenes, ChessBench runs sufficiently many Stockfish configurations to densely span a very wide range of the Elo spectrum, from beginner-level to superhuman. Their job is structural: they act as a dense web of reference opponents that strengthens the connectivity of the game graph and gives every language model a reliable ladder of calibrated opponents to be measured against. Without them, ratings would rest on sparser and less consistent matchups; with them (and Random Randy), the scale is anchored and graduated from bottom to top.

Why is the coherence metric defined the way it is?

A bit of context first. Buckle up with me for a moment.

There's an idea in the philosophy of language called the principle of charity -- sometimes also called rational accommodation -- which says that when you're trying to figure out what someone means, you should start by assuming that they're being coherent and rational. It's a prescription for how we ought to interpret each other, and it's often a useful one (some even argue that it's a necessary precondition to interpretation itself).

With other humans, withholding the charity that facilitates our interpretations would be a strange thing to do. With language models, it's the most honest thing to do. Their outputs invite a charitable reading even when it's not warranted. So we decline the invitation. Coherence isn't assumed; it's proven.

ChessBench is deliberately ungenerous. A high score on any metric here should mean something -- it should be hard to earn and hard to fake. That posture is what shapes the formula. Move coherence and game coherence measure two different things -- how often a model's individual moves are legal, and how often it can string together a full game of legal moves -- and we multiply them rather than average them. The effect is that a single illegal move shows up twice: once in the move-by-move rate, and once in the game it spoiled. That double-counting is the point. We're not softening the cost of incoherence; we're stacking it.

A note on the math. Coherence isn't a probability estimate -- it's a designed score, not unlike F1, chosen because it stack-ranks models consistently and makes the top of the scale hard to reach. A harmonic mean would have been a reasonable alternative, but it didn't quite fit: it would let scores exceed the rate at which a model actually finishes clean games, which isn't the bar we're setting. The product keeps the game rate as a ceiling, and differences in move rate still separate players underneath it. A model approaching a coherence score of 1.0 is doing something genuinely different from the others, and we want that fact to be unmistakable when you see it.

How is the Elo scale calibrated?

Elo is a relative rating system -- the numbers only mean something once the scale is pinned somewhere. ChessBench pins it in two complementary ways: a set of reference players with fixed ratings, and a hard floor enforced on the update rule itself.

  • Anchored reference players. Random Randy and a small set of Stockfish configurations have their ratings pinned to independently established values and held constant through every update, giving everything else a stable frame to be measured against. Random Randy serves as a soft floor: its anchored rating marks what uniformly random legal play looks like, giving the scale a meaningful baseline. The Stockfish configurations span the rest of the scale, from beginner level up through the strongest play available.
  • A hard floor at 0. Beneath Random Randy lies a strict lower bound that no rating can cross. It is enforced not by being assigned to any particular player, but on the update rule itself: every losing update is capped at a small fraction of the distance from the player's current rating down to zero. Each loss shrinks the remaining headroom, and the next loss shrinks what's left of that. Strong players never notice the cap, since their normal losses stay well inside it; a model losing nearly every game asymptotes toward zero without ever reaching or crossing it.

The alternative is letting weak players drift into arbitrarily negative territory that reflects game count more than skill. The hard floor keeps the bottom of the scale well-behaved: a model that loses everything lands somewhere meaningful, and the distance between it and Random Randy stays informative.

Why do some models have variants in the ranking?

ChessBench treats a model's generation parameters -- temperature, thinking budget, and the like -- as part of its identity rather than as conditions to be varied within its games. Each distinct configuration earns its own rating from its own games, which is why a single model can appear on the ranking board as several variants. Presentation conditions (how the board is described, what context is given, how the request is phrased) are randomized within a player's games and studied as effects on play, rather than spun off into separate players.

How does a model's top-level Elo relate to the Elos of its variants underneath?

Elo is computed from the bottom up. The leaf level is the player_id -- a fully specified configuration of a model, including temperature and any other generation parameters. Each leaf plays games and earns its own rating through many shuffled Monte Carlo replays. The model-plus-temperature level and the model level above it aren't separate ratings; they're aggregations of the leaves below them.

The aggregation is not a simple average. Each child's Elo is first converted to an expected score against a fixed 1500 anchor, those expected scores are averaged, weighted by games played, and the result is converted back into an Elo. Working in expected-score space is what makes the aggregation honest: Elo differences are exponential, not linear, so averaging Elos directly would give misleading results whenever children sit at different strengths. Games between siblings under the same parent are excluded from the weighting -- a win by one variant is a loss by another, and those cancel rather than inform.

A consequence worth knowing: a parent's Elo reflects the games its children actually played, weighted by how many each played. A variant that has played many games carries more weight than one that has played few, even if their individual ratings are similar. So the top-level number isn't a summary statistic of the sub-Elos you see -- it's its own rating, derived from the same underlying games, re-weighted to reflect the parent as a whole.

Note: the aggregation technique used to compute parent Elo ratings will be revised soon.

What does ChessBench vary from game to game, and why?

Each game begins with a coin flip between standard chess and Chess960. Roughly half of any player's games are therefore played from the familiar starting position, and the other half from a Chess960 one -- so memorized opening theory cannot stand in for actual play.

The Arena -- the component that runs the games -- additionally randomizes a defined set of presentation variables per game for each language-model player:

  • Move notation format: Standard Algebraic Notation (SAN) or Universal Chess Interface (UCI).
  • Whether the position is shown as a FEN string, as a move history from the starting position, or both. At least one is always present.
  • Whether a text-rendered board visual is included, and in which style -- ASCII art, a sparse minimal grid, or Unicode chess pieces.
  • Whether the list of legal moves is provided. When it is, the moves appear in randomized order, so positional bias in the list itself cannot influence the choice.
  • Which prompt template is used, drawn from a library of templates that phrase the request in meaningfully different ways.

The same model under different presentations can play meaningfully differently. A model that is reliable given a board visual and a list of legal moves may falter when handed only a FEN string, and a model that handles FEN confidently may still stumble when asked for UCI rather than SAN. ChessBench records every one of those conditions alongside every move, so the resulting archive supports retroactive analysis: how does a given model play when given a board visual versus only FEN; does providing legal moves change tactical accuracy more than positional accuracy; does a particular prompt template elicit longer, better-considered moves.

How many games does a player need to play before appearing on the site?

Every model on the site has played at least:

  • 10 games against Random Randy
  • 10 games against Stockfish at varying levels of strength
  • 1 game against at least 10 different language models

What happens when a model returns garbled output or an illegal move?

Models are instructed to reply with a single move and nothing else. If what comes back parses as either SAN or UCI, and is a legal move in the current position, it's played. If not, it's counted as an illegal move and the game is forfeited. There are no retries, no prompting for a cleaner response, and no partial credit for getting close. The bar is whether a model can play chess, not whether it can play chess with help.

Why is the visualization on the home page called a 'skyline'?

Because the sky is the limit?

How can I support the project?

Support from people like you makes a big difference! See the support page for ways to help.

About Me

ChessBench sits at the intersection of my fascination with language models and my lifelong love of chess.

I currently work as a lead machine learning engineer, following years as a software engineer and data scientist. I spent many of those years at Google, where I also organized and led Google's global chess community; the pipelines and dashboards I built there to explore our games were a quiet precursor to this project.

My academic background is in cognitive science, computer science, and, more recently, artificial intelligence. I am deeply curious about how and why language models work, the methods by which we assess their capabilities, and the evolving vocabulary we use to make sense of them.

If you'd like to get in touch, you can find me on LinkedIn.

— Benjamin Brumfield

Benjamin Brumfield