⚡

vLLM

Fastest open-source LLM inference engine — production-grade serving

Inference Apache 2.0 Self-hosted Advanced

What is vLLM?

vLLM is the industry-standard inference engine for serving open-source LLMs at scale. It uses PagedAttention to dramatically improve GPU memory efficiency, enabling higher throughput and lower latency than naive PyTorch serving. Used in production by dozens of companies to serve Llama, Mistral, and other open models.

Quick start

pip install vllm

python -m vllm.entrypoints.openai.api_server --model meta-llama/Llama-3.3-70B

Your local server now has an OpenAI-compatible API

Point any OpenAI SDK to http://localhost:8000

Use cases

→Production LLM serving

→High-throughput inference

→OpenAI-compatible API endpoint

→Multi-GPU deployment

Compatible models

Llama 3.3MistralDeepSeekQwenAny HuggingFace model

Why this matters for India

// india context

If you have a GPU server (even a single A100 from AWS), vLLM turns it into a powerful private LLM API.