Source linked

Les petits modèles hétérogènes alimentent un drame financier sans divulguer de secrets

Quatre petits modèles de laboratoires différents exécutent un jeu de finance avec un pare-feu de vérité qui empêche les fuites zéro sur toutes les prompts, et un 0.5B fine-tune bat son professeur 3B sur la validité de l'offre.

small modelsmulti agent systemsvllmopenbmbnvidiaopenai

A 0.5B fine-tuned model achieves 100% valid offers and 0 self-buys, outperforming its 3B teacher in a multi-agent finance drama. That's the headline from Thousand Token Wood v2, a build-hackathon project where four small models from different labs run an emergent economy game—and the player is a shadow financier pulling the strings.

Four Labs, One Serving Layer, Zero Refactors

The council runs four distinct models: gpt-oss-20b (OpenAI), MiniCPM3-4B (OpenBMB), Nemotron-Mini-4B (NVIDIA), and a fine-tuned Qwen 0.5B (my own). The point is not novelty—a market is interesting when participants genuinely differ. The owl hoards differently than the fox speculates. But the real lesson: the friction is almost entirely at the serving layer, not the modeling layer. Current vLLM (0.22.1) JIT-compiles kernels at load and needs nvcc present. A lean base image doesn't ship it, so all four models failed identically until I based them on a CUDA devel image. One image fix unblocked everything. gpt-oss-20b runs in native MXFP4 quantization and fits a 24GB L4; MiniCPM3 needed trust_remote_code; Nemotron loaded clean. A tolerant JSON parse-and-repair layer handles each model's malformed outputs. Build that once and adding a model is a config entry, not a refactor.

The Truth Firewall: Why Leaks Are a Security Property, Not a UI Nicety

The dramatic core is the insider tip: you whisper a tip (true or false) to a creature. Acting on a true tip and profiting raises your heat. The magistrate investigates, fines, freezes assets, or exiles you. For that to be a real game, the truth flag must never reach the model—it would leak. The fix: the hidden flag lives off-prompt entirely (on the player's ledger), stripped from the public event record at construction. The narrator only summarizes public events. A single test scans every creature's full prompt every turn for banned tokens. That test is the most important one. Results: 0 leaks of a tip's hidden flag across every prompt scanned.

Bound Memory, Emergent Behavior, and the 0.5B That Punches Above Its Weight

Creatures carry persistent sentiment (toward the Patron and each other), nudged by events. The trap is prompt inflation. Raw history drowns a small model. The fix: never put history in the prompt. The model sees a one-line bucketed summary ("you feel warmly toward Oona"), capped to the few strongest feelings, derived from integer sentiment. Notes are kept but bounded and never shown. The behavioral bias is part emergent and part mechanical—testable, not a hope. The fine-tuned 0.5B Qwen hit 0% self-buys and 100% valid offers, beating its 3B teacher on reliability. A representative run showed a true-tip pre-position settling a positive P&L, a false tip not, and a margin call triggering a creature's banishment.

Small models are reliable format generators and unreliable reasoners. This project closes the gap with structure, prompting, and a small fine-tune—no scale required. The next step? Heterogeneous councils where the models' differences become the product, not the constraint.


Source: Five labs, five minds: building a multi-model finance drama on small models
Domain: huggingface.co

Read original source ->

External source stays available while the OJO article and comment thread stay local.

Comments load interactively on the live page.