Unified Redundant Arithmetic Eliminates Conditional Corrections in NTT Accelerators

Polynomial arithmetic in lattice-based PQC — specifically ML-KEM (CRYSTALS-Kyber) and ML-DSA (CRYSTALS-Dilithium) — spends a shocking amount of time on modular reduction correction steps and dedicated inverse-transform scaling units. Most hardware accelerators for the Number Theoretic Transform (NTT) and its inverse (INTT) rely on pipelined butterfly units that get bogged down by conditional corrections in Montgomery multiplication and extra hardware for post-transform scaling.

Redundant representation kills the correction step

Standard Montgomery modulo multiplication requires a conditional subtraction (or addition) to keep the result within range before the next operation. This paper, from the authors on arXiv (2607.00621), introduces a redundant number representation that makes that conditional correction unnecessary. The key insight: by allowing intermediate values to live in a slightly larger residue space, both the Montgomery multiplication and a combined subtract-multiply operation can proceed without any comparison or branch. The same representation also absorbs the inverse-transform scaling — normally a dedicated scaling unit — directly into the existing arithmetic datapath. No extra multipliers, no extra cycles.

Hierarchical Montgomery multipliers that actually fit DSP slices

FPGA DSP resources are fixed-width and scarce. The authors design hierarchical Montgomery multipliers that decompose the modular multiplication into chunks that map efficiently onto vendor DSP primitives. This avoids the LUT-heavy wide multipliers that plague alternative implementations. The microarchitecture is a parallel iterative NTT/INTT datapath built from these unified butterfly units. Each unit fuses multiplication, redundant arithmetic, and scaling into a single pipeline stage.

FPGA results: higher frequencies, fewer cycles

The accelerator achieves higher clock frequencies than prior art (exact numbers are in the paper) with competitive LUT and DSP counts. Execution times drop accordingly. For RISC-V or embedded applications that need to run ML-KEM decapsulation or ML-DSA signing in microseconds on an FPGA, this is the difference between hitting the timing budget and failing it.

What comes next: expect this redundant arithmetic approach to spread beyond NTT — same trick works for other ring operations in lattice-based cryptography, and could make its way into hardened SoC blocks for the post-quantum migration.

Source: High-Performance NTT Accelerators for PQC leveraging Unified Redundant Arithmetic and Fine-Tuned Microarchitecture
Domain: arxiv.org

Unified Redundant Arithmetic Eliminates Conditional Corrections in NTT Accelerators

Redundant representation kills the correction step

Hierarchical Montgomery multipliers that actually fit DSP slices

FPGA results: higher frequencies, fewer cycles

More in Systems Engineering