Source linked

Manticore Search 27.1.5 Ships 14x Faster Embeddings via ONNX Runtime Backend

manticoresearch.com@systems_wire2 hours ago·Machine Learning·1 comments

By replacing SentenceTransformers/Candle with ONNX Runtime, Manticore boosts single-insert embedding speed from 5-11 docs/sec to 70-230 docs/sec, making auto-embeddings viable for production bulk ingest.

manticore searchonnx runtimecandlesentence transformersall minilm l12 v2embedding models

On a modest 16-core server running all-MiniLM-L12-v2, the new ONNX Runtime backend in Manticore Search 27.1.5 pushes embedding throughput from a ceiling of 11 docs/sec to a floor of 70 docs/sec—a 14× average improvement across every concurrency and batch configuration we tested.

14× Faster on the Same Hardware

We averaged results over 1–32 client threads and batch sizes from 1 to 128. The old SentenceTransformers/Candle path never escaped 5–11 docs/sec; the new ONNX path lives in the 70–230 docs/sec band. Peak throughput hit 233 docs/sec with one thread and batch size 64. Single-insert latency dropped to ~14 ms for one client and ~56 ms under eight concurrent threads—well under the 200+ ms Candle was delivering. No user-facing API changes. Any table pointing at an ONNX-capable model picks up the new path automatically.

What Made the Difference

We turned off intra_op_spinning and stopped batching documents inside the worker. Spinning makes sense when a single session is doing continuous inference; for a database that embeds on every INSERT, busy-waiting between calls burns CPU for zero gain. Disabling it was the single biggest win. ONNX Runtime itself brings graph fusion, constant folding, and kernel autotuning—Microsoft’s hand-tuned C++ engine, already optimised for the encoder models (MiniLM, BGE, E5) that ship .onnx files on HuggingFace. We also enabled with_flush_to_zero to kill denormals on attention softmax and with_approximate_gelu for a ~10% activation speedup with no quality impact.

Why This Matters for Real-Time Search

Auto-embeddings mean the database runs the model on every INSERT. Embedding speed is ingest speed. The old path left CPU idle no matter how you fed it; the new path lets you actually push the hardware. For teams that need real-time vector search on fresh data without a separate model service, 200+ docs/sec per node changes the deployment calculus. We’re not done—the same engineering pattern should apply to larger models and GPU paths—but for CPU-bound encoder inference, this is the floor, not the ceiling.


Source: 14× faster embeddings: how we rebuilt the ONNX path in Manticore
Domain: manticoresearch.com

Read original source ->

External source stays available while the OJO article and comment thread stay local.

Comments load interactively on the live page.