Pangram 3.3.2 Reaches 100% Detection Accuracy by Layer 24 - Here's How Its Internal Representations Evolve

Layer 2 of Pangram 3.3.2 already hits 0.83 binary accuracy distinguishing human from AI text.

That's about where a bag-of-words model lands, but Pangram's model is an LLM fine-tuned for sequence classification - no perplexity, no burstiness, no manual feature extraction. The real story is how its internal representations evolve across 24 layers. The team built an interactive explorer that exposes PCA, UMAP, and t-SNE projections of hidden states at every even layer.

Probing the Network Layer by Layer

Pangram trained linear probes on 500 balanced samples (80/20 split) at each even layer. Accuracy climbs steadily, reaching 1.0 at layer 24. That's perfect separation on a held-out evaluation set spanning 20 model families and 12 source domains. Early layers already encode strong signals - the model isn't waiting until the final readout to figure out what's AI.

Each probe asks a simple question: can a linear classifier recover the human/AI label from the activations at that layer? High probe accuracy means that distinction is already present in a linearly accessible direction of the representation space. By layer 24, the answer is a clean yes.

What the Activations Reveal About AI vs. Human Text

PCA projections late in the network show most variance concentrated in principal components 1 and 2, with human and AI samples forming distinct clumps. UMAP preserves neighborhood structure: AI texts from the same model family - say, Claude Opus 4.5 or GPT-5.2 - cluster together. Human texts from Creative Writing cluster separately from Wikipedia or Product Reviews.

This is not a single binary axis. The model learns rich semantic structure that correlates with both the source domain and the generating model. For researchers, seeing that the model picks up on model-family-specific patterns is useful for spotting shortcutting or unintended correlations.

The Dataset Behind the Dots

The interpretability dataset uses a balanced 5,000-document subset (half human, half AI) drawn from Pangram's production training set. AI samples include Claude 3.7 Sonnet, Claude Sonnet 4, Claude Opus 4.1, GPT-4o, GPT-5.2, Gemini 2.5 Pro, DeepSeek R1, Qwen 3 235B, Llama 3.1 70B, and a dozen others. Source domains range from News and Scientific Abstracts to Reddit ELI5 and self-published books. Human text comes from Project Gutenberg, Wikipedia, ESL corrections, and more.

That breadth matters. The model has to generalize across wildly different writing styles and model lineages. The interactive explorer lets you pick a layer and a projection method, then inspect which points belong to which model family or domain.

Pangram's interpretability work, applied retroactively to versions 3.1 and 3.2, gives us a window into what an AI detector actually sees - and it's a lot more than just the word "delve" or the overuse of em-dashes. The next step is to use these probes to prevent shortcutting and fix unintended model behavior, layer by layer.

Source: Exploring the internal representations of Pangram 3.3.2
Domain: pangram.com

Pangram 3.3.2 Reaches 100% Detection Accuracy by Layer 24 - Here's How Its Internal Representations Evolve

Probing the Network Layer by Layer

What the Activations Reveal About AI vs. Human Text

The Dataset Behind the Dots

More in Artificial Intelligence