CivBench Pits AI Contra la Civilización VI - El Agente Construyó un Nuke y Todavía Perdido

I spent a weekend turning Civilization VI's debug port into an MCP server, then watched an AI agent build two nuclear devices and level Toulouse because French tourism had won the culture war before the agent knew what hit it.

France won anyway. Not in the way the agent was trying to stop it. The agent chased a military solution to a cultural problem and lost on a dimension it hadn't even been tracking.

The Wrong Benchmark

A year ago I built GovBench: 3,497 multiple-choice questions about UK legislation and parliamentary procedure. Gemma 3 27B scored 94% out of the box. I fine-tuned for three weeks and gained 1.37 points. GPT-5 hit 99.26%.

I'd built a glorified quiz bot. Measuring recall and calling it reasoning is a category error. A model that picks the right option about parliamentary procedure cannot help you navigate parliamentary procedure. That failure is what sent me looking for a keyhole into a game engine on a Saturday night.

Why a Hex Grid

Government decisions compound. A health policy looks brilliant today then cascades into a housing crisis fifteen years later. Civilization VI simulates exactly that kind of emergent complexity. The decision space hits $10^{166}$ possible actions per turn by the late game. Six victory conditions (science, culture, domination, religion, diplomacy, score) mean no single objective dominates. You have to read the board and decide what game you're playing.

I found a debug port buried in Civ VI's engine. Over a weekend I turned it into an MCP server with 76 tools. Claude Code was both co-developer and playtester. The agent plays through text: no map, no minimap, no music cues. A single get_game_overview call returns the entire state as four lines of raw tags. get_units lists units with coordinates and HP, but nearby threats appear only if the agent explicitly asks.

Playing Through Text

A human sees a hex grid. The agent sees nothing until it calls a tool. It has no peripheral vision. That Man-at-Arms two tiles from a city exists only because the agent asked for nearby threats.

In the test run the agent built a trade network, dominated alliances, and set up a diplomatic victory. It didn't notice France's quiet culture invasion. By the time the agent recognised the threat, tourism was embedded in every city. Every peaceful counter failed. It built two nuclear devices and levelled Toulouse. France still won, not culturally either - it won a diplomatic victory while the agent was busy nuking.

CivBench isn't a toy. It's a direct test of whether an AI can sustain a goal across hundreds of decisions, notice when the world has changed, and adapt. The answer so far: it can build a powerful strategy but miss the existential threat sitting inside its own cities. That's exactly the kind of failure mode we need to surface before we trust these systems with actual governance.

Source: I Gave an AI a Civilization to Run. It Built a Nuke - Launching CivBench
Domain: lwilko.com

CivBench Pits AI Contra la Civilización VI - El Agente Construyó un Nuke y Todavía Perdido

The Wrong Benchmark

Why a Hex Grid

Playing Through Text

More in Artificial Intelligence