PhantaField’s PFG-1 Sophon ASIC packs 330 GB of on-die DRAM into a 750 mm² monolithic 3D die — and that single chip delivers 2,100 TFLOPS BF16 training while providing 191x the weight-fetch bandwidth of an NVIDIA Rubin with HBM4.
How Sophon Kills the HBM Bottleneck
HBM is the bottleneck. Every modern GPU at low batch is bandwidth-bound, serializing weight fetches through a ~22 TB/s (Rubin) or ~19.6 TB/s (MI455X) HBM4 path. Sophon replaces that with on-die 2T0C gain-cell DRAM built from 2D transition-metal dichalcogenide (TMD) transistors. The result: 191–214x the weight bandwidth of an HBM4 package — a gap no HBM roadmap closes.
That bandwidth comes from digital compute-in-memory: each 256×256 DRAM subarray tile pairs a sense amp with an 8-level adder tree, driven by a 500 MHz bit-serial activation broadcast. 131,072 tiles per die yield 4,200 TFLOPS FP8 and 2,100 TFLOPS BF16. The die uses a 28 nm Si CMOS base tier with a 32-tier TMD MAC stack stacked above — MIV vias connect everything. No HBM stacks, no interposer, no $2M rack memory line item.
Training and Inference on One Die
Training an 80B model? Sophon fits weights, gradients, and optimizer state entirely on-die with ~10 GB of headroom for gradient-checkpointed micro-batches. That’s a single die that trains at 2,406 tokens/s BF16 (0.23 J/tok) and then serves the same model at 7,219 tokens/s native BF16 or 14,438 tokens/s FP8 — without swapping hardware.
Energy per MAC is 0.620 pJ for BF16 forward, 0.940 pJ for forward+backward. Peak efficiency hits 3.72 TFLOPS/W on BF16 training average. Idle power collapses to ~3 W because the TMD DRAM retains data for seconds without refresh; refresh overhead is only 0.08 W. Compare that to a 288–432 GB HBM4 subsystem that draws 10–15 W just to keep the model resident.
The Economics Are Brutal for NVIDIA and AMD
Morgan Stanley estimates a single NVIDIA VR200 (Rubin) NVL72 rack at $7.8M — with HBM alone costing $2.0M (25.7% of the rack). Sophon’s BOM is $8,358 per die. That’s a 9.9x reduction in hardware cost versus Rubin for equivalent 80B model throughput. Against an AMD MI455X, it’s 11.6x cheaper.
Sophon delivers ~2.7–3.1x higher 80B batch-1 training throughput per die and ~48–53x higher single-stream FP8 decode throughput than those 2026 HBM4 parts. The peak dense FLOPS of the GPUs are higher, but at low batch — where real serving lives — weight-memory bandwidth is the dictator. Sophon owns that dictator.
This is the first chip I’ve seen that treats the memory hierarchy as a physics problem rather than a packaging problem. If PhantaField scales to larger die stacks or higher tier counts, the HBM era ends.
Source: Sophon PFG-1: a monolithic-3D AI ASIC with 330 GB of on-die DRAM and no HBM
Domain: phantafield.com
Comments load interactively on the live page.