Zoho Labs verabschiedet Modelltraining für Inference Engineering

Open-weight models supporting 90 language pairs, released for free in 2023, vaporized five years of Zoho Labs' work building just 15 language pairs.

Ramprakash Ramamoorthy, Director of AI Research at Zoho Corp, laid this out bluntly at DevSparks 2026 in Bengaluru. The lab's translation project ran from 2018 to 2023. Then Llama, Mistral, and friends arrived. The team's central question shifted overnight: what does an in-house AI lab do when anyone can download a better model for free?

Five Years of Work Overtaken in Months

Zoho Labs started in 2011 as a centralized fix for repeating engineering problems across Zoho's 100-plus products. AI work expanded into machine learning, computer vision, document processing, and language tools. By 2023, open-weight models had made most of that custom model building uneconomical.

Zoho ran three parallel experiments: Zoho AI Bridge (connecting customers to third-party or self-hosted open-weight models), a small in-house model for basic email and document summaries, and what became the lab's primary direction — inference engineering. The third won.

The 101% Project: Squeezing Transformers

Before committing, the team explored alternatives to the transformer architecture — RWKV, Mamba, Zamba — each promising better efficiency per dollar. But the transformer ecosystem improved faster than any challenger could catch up. So Zoho went all-in on making the transformers already in production run as efficiently as possible.

Ramamoorthy called it the 101% project. With roughly six billion API calls per month hitting Zoho's AI systems and a constrained GPU budget, the math was simple: every efficiency gain multiplied across billions of requests.

Techniques That Work at Scale

Quantization came first. Zoho's version compresses model weights selectively: identify the critical weights, leave them untouched, compress the rest. Speed goes up, accuracy barely budges.

KV cache management operates like a smart short-term memory — keep frequently accessed context, flush what's rarely touched. Continuous batching piles multiple incoming requests together rather than processing them one-at-a-time, improving throughput on the same hardware.

Speculative decoding uses a small, cheap model to draft a response, then passes it to a larger model for verification. You get the quality of the big model at a fraction of the compute cost. "Even my engineers do it," Ramamoorthy said, "they write the code using Sonnet and then use Opus to debug it."

Zoho's approach is pragmatic, not theoretical. For a bootstrapped company that can't splurge on GPUs, the opportunity in AI is no longer about building new models. It's about running existing ones so efficiently that open-weight models become a cost advantage rather than a threat.

Inference engineering is the new training.

Source: How Zoho Labs pivoted to inference engineering
Domain: yourstory.com

Zoho Labs verabschiedet Modelltraining für Inference Engineering

Five Years of Work Overtaken in Months

The 101% Project: Squeezing Transformers

Techniques That Work at Scale

More in Artificial Intelligence