Source linked

Modal Auto Endpoints Let You Own the Inference Stack, Not Just the API

Modal's new Auto Endpoints expose the full serving stack - code, metrics, and engine tuning - while matching proprietary inference providers on performance using speculative decoding and DFlash drafters.

modalauto endpointsinferencesglangdflashspeculative decoding

Most managed inference providers hand you a black box API and tell you to trust them. Modal's new Auto Endpoints do the opposite: they hand you the entire serving stack as a Modal App you can see, fork, and tune. Every flag, every metric, every engine patch is yours.

Code You Can See, Metrics You Can Actually Debug

Three decisions make Auto Endpoints different from every other inference-as-a-service offering. First, Modal does not hide the code. GPU selection, regionalization, inference engine flags, even a "cutty engine patch" - all shared. Second, they surface the metrics that matter for debugging: speculative decoding acceptance length and per-replica engine-side token latency quantiles. Third, no "talk to sales" barrier. Deploy a frontier open model like GLM-5.2-FP8 with a single CLI command: modal endpoint create --name agent --model zai-org/GLM-5.2-FP8.

The Infrastructure Behind Low-Latency Ownership

Modal built this on their existing AI infrastructure platform - the same one used for folding proteins, driving robots, and making music. The new piece is Modal Servers, a fundamental component released from beta that removes queueing and regionalizes routing by default. You get 5ms overhead on HTTP requests without compromising autoscaling or reliability. No reserved GPU capacity, no capacity management headaches - pay per use, scale to demand.

Winning on Performance Without Proprietary Lock-In

Inference engines are the new PostgreSQL: complex, mission-critical, full of knobs. Modal developed their deployment recipes in direct competition with proprietary providers, and they won by betting on open source. They patch and upstream improvements to SGLang and FlashAttention-4 as needed. Their secret sauce is speculative decoding using the DFlash block-diffusion drafter architecture from Z Lab. They worked closely with Z Lab and the SGLang team to make DFlash fast and reliable in production, and even trained and released their own drafter models.

What This Unlocks Next

Auto Endpoints give you a starting deployment informed by Modal's work with teams like Cognition, Decagon, Fathom, and DoorDash. When you deploy, you see latency and throughput tradeoffs under load with a single click. The line between "hand-rolled inference" and "managed service" just got a lot thinner - and you keep the keys to the engine room.


Source: Modal Auto Endpoints: Optimized inference you own
Domain: modal.com

Read original source ->

External source stays available while the OJO article and comment thread stay local.

Comments load interactively on the live page.