vLLM

Open-source LLM inference engine with PagedAttention and continuous batching.

vLLM is profiled here as a Deployment tool for engineering teams. Read about features, pricing, and how it compares to related options in the tools directory.

DeploymentOpen Source

Visit Website GitHub

Description

Short Intro: vLLM is an open-source LLM inference engine created at UC Berkeley's Sky Computing Lab in 2023 by Woosuk Kwon, Ion Stoica, and collaborators, published at SOSP 2023 and now maintained under the PyTorch Foundation with 77,000+ GitHub stars and 2,000+ contributors. The project introduced PagedAttention, an algorithm that applies virtual memory concepts from operating systems to KV cache management, dramatically reducing GPU memory waste and enabling high-throughput concurrent serving. Meta, Google, Character.AI, Mistral AI, Cohere, and Roblox run vLLM in production, and in January 2026 the founding team launched Inferact, a commercial entity backed by $150M from a16z, Lightspeed, and Sequoia at an $800M valuation, to build managed services on top of the Apache 2.0 project.

Key Capabilities:

PagedAttention KV cache management eliminating GPU memory fragmentation
Continuous batching for packing concurrent requests into single GPU passes
OpenAI-compatible API server deployable with a single command
Tensor parallelism and pipeline parallelism for multi-GPU serving
Speculative decoding using draft models to predict multiple tokens per pass
Chunked prefill for balancing long-prompt latency against decode throughput
Prefix caching for shared system prompts and few-shot examples
Disaggregated serving separating prefill and decode onto different hardware
Quantization support for reduced GPU memory footprint
Multi-hardware support across NVIDIA, AMD, Google TPU, Intel, and AWS Neuron
vLLM-Omni multimodal extension for image, video, audio, and text-to-speech workloads
Community integrations for Kubernetes deployment, semantic routing, model quantization, and speculative decoding

See vLLM pricing details →

Alternative tools

Dokku
Self-hosted platform-as-a-service on your own server
Heroku
Managed platform for deploying apps with git push
Porter
Platform-as-a-service that runs in your own cloud account
Kamal
Deploy containerized apps to your own servers
Coolify
Self-hosted deployment platform for any server
Netlify
Git-driven platform for deploying modern web frontends

Used in Stacks

No saved stacks include this tool yet.

Browse more in Deployment