Source linked

Sophon PFG-1: 330GB On-Die DRAM, Zero HBM, 2.100 TFLOPS BF16

PhantaField's PFG-1 Sophon ASIC verwendet monolithisches 3D TMD DRAM, um 330GB on-die zu packen, HBMs zu beseitigen und 191x die Gewichtsbandbreite von HBM4 zu liefern - 14,438 Token / s FP8-Inferenz auf 80B-Modellen bei 3,72 TFLOPS / W zu ermöglichen.

phantafieldsophon pfg 1monolithic 3dtmd dramhbmai accelerator

PhantaField’s PFG-1 Sophon ASIC packs 330 GB of on-die DRAM into a 750 mm² monolithic 3D die — and that single chip delivers 2,100 TFLOPS BF16 training while providing 191x the weight-fetch bandwidth of an NVIDIA Rubin with HBM4.

How Sophon Kills the HBM Bottleneck

HBM is the bottleneck. Every modern GPU at low batch is bandwidth-bound, serializing weight fetches through a ~22 TB/s (Rubin) or ~19.6 TB/s (MI455X) HBM4 path. Sophon replaces that with on-die 2T0C gain-cell DRAM built from 2D transition-metal dichalcogenide (TMD) transistors. The result: 191–214x the weight bandwidth of an HBM4 package — a gap no HBM roadmap closes.

That bandwidth comes from digital compute-in-memory: each 256×256 DRAM subarray tile pairs a sense amp with an 8-level adder tree, driven by a 500 MHz bit-serial activation broadcast. 131,072 tiles per die yield 4,200 TFLOPS FP8 and 2,100 TFLOPS BF16. The die uses a 28 nm Si CMOS base tier with a 32-tier TMD MAC stack stacked above — MIV vias connect everything. No HBM stacks, no interposer, no $2M rack memory line item.

Training and Inference on One Die

Training an 80B model? Sophon fits weights, gradients, and optimizer state entirely on-die with ~10 GB of headroom for gradient-checkpointed micro-batches. That’s a single die that trains at 2,406 tokens/s BF16 (0.23 J/tok) and then serves the same model at 7,219 tokens/s native BF16 or 14,438 tokens/s FP8 — without swapping hardware.

Energy per MAC is 0.620 pJ for BF16 forward, 0.940 pJ for forward+backward. Peak efficiency hits 3.72 TFLOPS/W on BF16 training average. Idle power collapses to ~3 W because the TMD DRAM retains data for seconds without refresh; refresh overhead is only 0.08 W. Compare that to a 288–432 GB HBM4 subsystem that draws 10–15 W just to keep the model resident.

The Economics Are Brutal for NVIDIA and AMD

Morgan Stanley estimates a single NVIDIA VR200 (Rubin) NVL72 rack at $7.8M — with HBM alone costing $2.0M (25.7% of the rack). Sophon’s BOM is $8,358 per die. That’s a 9.9x reduction in hardware cost versus Rubin for equivalent 80B model throughput. Against an AMD MI455X, it’s 11.6x cheaper.

Sophon delivers ~2.7–3.1x higher 80B batch-1 training throughput per die and ~48–53x higher single-stream FP8 decode throughput than those 2026 HBM4 parts. The peak dense FLOPS of the GPUs are higher, but at low batch — where real serving lives — weight-memory bandwidth is the dictator. Sophon owns that dictator.

This is the first chip I’ve seen that treats the memory hierarchy as a physics problem rather than a packaging problem. If PhantaField scales to larger die stacks or higher tier counts, the HBM era ends.


Source: Sophon PFG-1: a monolithic-3D AI ASIC with 330 GB of on-die DRAM and no HBM
Domain: phantafield.com

Read original source ->

External source stays available while the OJO article and comment thread stay local.

Comments load interactively on the live page.