Source linked

Sophon PFG-1 : 330 Go de DRAM, Zero HBM, 2100 TFLOPS BF16

Le PFG-1 Sophon ASIC de PhantaField utilise le monolithique 3D TMD DRAM pour emballer 330 Go sur mort, éliminant les HBM et fournissant 191 fois la largeur de bande de poids de HBM4 - permettant une déduction FP8 de 14 438 jetons/s sur les modèles 80B à 3,72 TFLOPS/W.

phantafieldsophon pfg 1monolithic 3dtmd dramhbmai accelerator

PhantaField’s PFG-1 Sophon ASIC packs 330 GB of on-die DRAM into a 750 mm² monolithic 3D die — and that single chip delivers 2,100 TFLOPS BF16 training while providing 191x the weight-fetch bandwidth of an NVIDIA Rubin with HBM4.

How Sophon Kills the HBM Bottleneck

HBM is the bottleneck. Every modern GPU at low batch is bandwidth-bound, serializing weight fetches through a ~22 TB/s (Rubin) or ~19.6 TB/s (MI455X) HBM4 path. Sophon replaces that with on-die 2T0C gain-cell DRAM built from 2D transition-metal dichalcogenide (TMD) transistors. The result: 191–214x the weight bandwidth of an HBM4 package — a gap no HBM roadmap closes.

That bandwidth comes from digital compute-in-memory: each 256×256 DRAM subarray tile pairs a sense amp with an 8-level adder tree, driven by a 500 MHz bit-serial activation broadcast. 131,072 tiles per die yield 4,200 TFLOPS FP8 and 2,100 TFLOPS BF16. The die uses a 28 nm Si CMOS base tier with a 32-tier TMD MAC stack stacked above — MIV vias connect everything. No HBM stacks, no interposer, no $2M rack memory line item.

Training and Inference on One Die

Training an 80B model? Sophon fits weights, gradients, and optimizer state entirely on-die with ~10 GB of headroom for gradient-checkpointed micro-batches. That’s a single die that trains at 2,406 tokens/s BF16 (0.23 J/tok) and then serves the same model at 7,219 tokens/s native BF16 or 14,438 tokens/s FP8 — without swapping hardware.

Energy per MAC is 0.620 pJ for BF16 forward, 0.940 pJ for forward+backward. Peak efficiency hits 3.72 TFLOPS/W on BF16 training average. Idle power collapses to ~3 W because the TMD DRAM retains data for seconds without refresh; refresh overhead is only 0.08 W. Compare that to a 288–432 GB HBM4 subsystem that draws 10–15 W just to keep the model resident.

The Economics Are Brutal for NVIDIA and AMD

Morgan Stanley estimates a single NVIDIA VR200 (Rubin) NVL72 rack at $7.8M — with HBM alone costing $2.0M (25.7% of the rack). Sophon’s BOM is $8,358 per die. That’s a 9.9x reduction in hardware cost versus Rubin for equivalent 80B model throughput. Against an AMD MI455X, it’s 11.6x cheaper.

Sophon delivers ~2.7–3.1x higher 80B batch-1 training throughput per die and ~48–53x higher single-stream FP8 decode throughput than those 2026 HBM4 parts. The peak dense FLOPS of the GPUs are higher, but at low batch — where real serving lives — weight-memory bandwidth is the dictator. Sophon owns that dictator.

This is the first chip I’ve seen that treats the memory hierarchy as a physics problem rather than a packaging problem. If PhantaField scales to larger die stacks or higher tier counts, the HBM era ends.


Source: Sophon PFG-1: a monolithic-3D AI ASIC with 330 GB of on-die DRAM and no HBM
Domain: phantafield.com

Read original source ->

External source stays available while the OJO article and comment thread stay local.

Comments load interactively on the live page.