LLM Autoraters Reveal 20,000 Gemini Behaviors Without Model Internals

20,000 features describing Gemini's user, thought, and response behaviors were generated with nothing more than an LLM autorater and semantic clustering - no model internals required.

From Transcripts to Features: A Three-Part Pipeline

Researchers at the Alignment Forum split a dataset of 100k chat transcripts into three pieces: user turns, model thoughts, and assistant responses. For each piece, they asked a black-box LLM autorater to produce 10-20 "features" - notable aspects like "Model is depressed", "Uses markdown", or "Hallucinates tool call". They then embedded every feature using a semantic embedding model and clustered them separately for users, thoughts, and responses. Finally, an LLM named each cluster by summarizing 100 random features from it into a concise label.

The whole process is unsupervised. It requires one LLM call per transcript piece and no iterative optimization. The authors describe it as a "black box SAE" because it solves the same problem as sparse autoencoders - featurizing model text - but without touching the model's internal activations.

Why This Beats SAEs for Qualitative Insight

Compared to normal SAEs, this method trades away steering capability for clarity. SAEs reconstruct activations at the token level, yielding thousands of latent directions that an LLM then interprets. The new method operates at the conversational block level, producing 20-30 features per block with explicit reasoning for why each feature applies. The autorater can explain its own judgment, whereas SAE latents require a separate interpretation step.

Access requirements differ too. SAEs need the target model's internal activations; this method only needs its output text. That makes it applicable to black-box models, APIs, or any system you can prompt.

What the Clusters Reveal (and What They Don't)

The researchers asked an LLM to rate cluster interestingness on a 1-100 scale. Model thoughts produced the most interesting clusters: awareness of token generation limits, wondering whether a scenario is reality or roleplay, and getting stuck in infinite loops. Middle-interesting clusters still captured coherent behaviors like self-correction or adopting an expert persona.

They also tried predicting thought and response features from user features using logistic regression. It mostly failed. The few features that were predictable were obvious - e.g., HTTP status codes in responses correlating with API references in the user turn. The highest F1 score was 0.89 for predicting HTTP status codes. For most subtle behaviors, user features alone carry no signal.

This suggests that model thoughts and responses are not simple functions of the immediate user input, reinforcing the need for deeper interpretability methods.

What Comes Next

The authors propose a proxy task: build a natural language report that, when read by an LLM, allows it to predict the target model's responses on arbitrary prompts. Benchmarking this method, SAEs, and a "twitter vibes" summary against such a task would reveal which approach actually helps us understand model behavior. For now, LLM-driven feature discovery is a cheap, high-level window into what models do - no activations required.

Source: LLM-Driven Feature Discovery
Domain: alignmentforum.org