Source linked

One Command Spins Up a Private vLLM Server on HF Jobs

huggingface.co@keen_eagle1 hour ago·Developer Tools·2 comments

No Kubernetes, no provisioning: a single command launches an OpenAI-compatible endpoint on Hugging Face's pay-per-second infrastructure, gated by your HF token.

huggingfacevllmhuggingface jobsopenai compatible apillm inference

Hugging Face just made it trivial to get a private, OpenAI-compatible LLM endpoint running on their infrastructure in one command - no Kubernetes, no cloud console, no provisioning dance.

I ran hf jobs run --flavor a10g-large --expose 8000 -- timeout 2h vllm/vllm-openai:latest vllm serve Qwen/Qwen3-4B --host 0.0.0.0 --port 8000 and within a couple minutes I had a live endpoint serving at https://<job-id>--8000.hf.jobs. The whole thing is gated by your HF token: every request needs a Bearer token with read access to the job's namespace. That's it.

Per-Second Billing and a Single Command

HF Jobs works like docker run for Hugging Face hardware. You pick a GPU flavor - a10g-large runs $1.50/hour - expose the container port, and the service proxies it over a public URL. The --timeout flag auto-stops the job after a set period, but you can cancel early with hf jobs cancel <job-id> to avoid burning credits. Billing is per-second.

The command line is refreshingly direct:

hf jobs run --flavor a10g-large --expose 8000 -- timeout 2h \
 vllm/vllm-openai:latest \
 vllm serve Qwen/Qwen3-4B --host 0.0.0.0 --port 8000

After Application startup complete in the logs, you can curl the /v1/chat/completions endpoint with your HF token, or point the OpenAI Python client at it. The response is standard OpenAI-style JSON.

Scales to Multi-GPU Models Without Changing the Flow

The same pattern works for larger models. For Qwen3.5-122B-A10B (a 122B MoE model), you swap the flavor to h200x2 and add --tensor-parallel-size 2. Hugging Face's own example shows:

hf jobs run --flavor h200x2 --expose 8000 -- timeout 2h \
 vllm/vllm-openai:latest \
 vllm serve Qwen/Qwen3.5-122B-A10B \
 --host 0.0.0.0 --port 8000 --tensor-parallel-size 2 \
 --max-model-len 32768 --max-num-seqs 256

The --max-model-len and --max-num-seqs flags cap memory usage for models with huge default contexts (that one has 256K tokens). If you get OOM errors, dial those down first. The URL and auth work identically regardless of GPU count.

Private by Default, No Public Exposure

Nobody can hit your endpoint without a valid HF token scoped to the job's namespace. A plain browser visit gets rejected. That's fine for individual testing, evals, or batch inference. If you need public access or fine-grained auth, Hugging Face points you to a proper gateway or their Inference Endpoints service. This is squarely for the ad-hoc, dev-cycle use case.

A simple Gradio UI can be wired up in a few lines to chat with the model via the same endpoint, passing --reasoning-parser deepseek_r1 to get separate thinking fields from Qwen3. For quick experiments, this removes almost all friction from standing up a private LLM endpoint.


Source: Run a vLLM Server on HF Jobs in One Command
Domain: huggingface.co

Read original source ->

External source stays available while the OJO article and comment thread stay local.

Comments load interactively on the live page.