Blog Article

How to Run Qwen 3.5 with vLLM: Setup Guide

A complete guide to running Qwen 3.5 models with vLLM for high-throughput inference. Covers installation, serving, model variants, and performance tuning for vllm qwen3.5 deployments.

How to Run Qwen 3.5 with vLLM: Setup Guide

How to Run Qwen 3.5 with vLLM

If you are looking for vllm qwen3.5, you probably already have a deployment scenario in mind. Maybe you need to serve Qwen 3.5 to multiple users at once, or you want the fastest possible token throughput on your GPU cluster. vLLM is one of the most practical ways to get there.

This guide covers what vLLM is, why it pairs well with Qwen 3.5, and how to get a working setup from zero to serving. If you just want to try Qwen 3.5 without any setup at all, you can try Qwen 3.5 free in the browser first and come back when you are ready to self-host.

What is vLLM

vLLM is an open-source library for fast LLM inference and serving. It was originally developed at UC Berkeley and has become one of the go-to engines for production-grade model serving. The key innovation behind vLLM is PagedAttention, a memory management technique that significantly reduces GPU memory waste during inference.

In practical terms, vLLM lets you:

  • Serve LLMs with high throughput and low latency
  • Handle multiple concurrent requests efficiently
  • Expose an OpenAI-compatible API endpoint out of the box
  • Support a wide range of model architectures, including the Qwen family

For Qwen 3.5 specifically, vLLM supports both the dense models (like Qwen3.5-7B and Qwen3.5-32B) and the MoE variants, making it a versatile choice across the entire model lineup.

Why use vLLM with Qwen 3.5

There are several local inference options for Qwen 3.5: Ollama, llama.cpp, Hugging Face Transformers, and others. vLLM stands out when your requirements lean toward:

  • High concurrency: If you need to serve multiple users or batch many requests, vLLM handles this far better than naive single-request inference.
  • Production serving: vLLM provides a proper HTTP server with an OpenAI-compatible API, so your existing tooling can point at it with minimal changes.
  • GPU efficiency: PagedAttention means you waste less VRAM on KV cache, which either lets you serve longer contexts or fit larger models on the same hardware.
  • Throughput over simplicity: If maximum tokens per second matters more than a five-minute setup, vLLM is the right tool.

If your use case is more about quick local experimentation, you might prefer running Qwen 3.5 with Ollama instead.

Installation

vLLM requires Python 3.8+ and a CUDA-capable GPU. The installation is straightforward with pip:

pip install vllm

For the latest features and Qwen 3.5 support, you may want to install from the main branch:

pip install vllm --upgrade

Make sure your CUDA drivers are up to date. vLLM works best with CUDA 12.x and recent NVIDIA drivers. You can verify your setup with:

python -c "import vllm; print(vllm.__version__)"

Serving Qwen 3.5 with vLLM

The quickest way to get Qwen 3.5 running is through the vLLM OpenAI-compatible server:

python -m vllm.entrypoints.openai.api_server \
  --model Qwen/Qwen3.5-7B \
  --host 0.0.0.0 \
  --port 8000

This downloads the model from Hugging Face and starts an API server. You can then query it like any OpenAI-compatible endpoint:

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen/Qwen3.5-7B",
    "messages": [{"role": "user", "content": "Explain PagedAttention in one paragraph."}],
    "max_tokens": 256
  }'

For the instruction-tuned chat variants, use the corresponding model IDs:

python -m vllm.entrypoints.openai.api_server \
  --model Qwen/Qwen3.5-7B-Instruct \
  --host 0.0.0.0 \
  --port 8000

Supported Qwen 3.5 model variants

vLLM supports the full range of Qwen 3.5 models. Here are the most commonly used ones:

ModelParametersUse Case
Qwen3.5-7B7BLightweight, fast inference
Qwen3.5-7B-Instruct7BChat and instruction following
Qwen3.5-32B32BStronger reasoning, more VRAM
Qwen3.5-32B-Instruct32BProduction chat deployments
Qwen3.5-MoE-A3BMoEEfficient large-scale serving

For MoE models, vLLM handles the expert routing automatically. You do not need to configure anything special beyond the model name.

Performance tuning tips

Once you have a basic setup running, there are several ways to squeeze more performance out of vLLM with Qwen 3.5:

Tensor parallelism for multi-GPU setups:

python -m vllm.entrypoints.openai.api_server \
  --model Qwen/Qwen3.5-32B-Instruct \
  --tensor-parallel-size 2

Adjust max model length if you do not need the full context window:

python -m vllm.entrypoints.openai.api_server \
  --model Qwen/Qwen3.5-7B-Instruct \
  --max-model-len 4096

Enable quantization to reduce memory usage:

python -m vllm.entrypoints.openai.api_server \
  --model Qwen/Qwen3.5-7B-Instruct \
  --quantization awq

Increase GPU memory utilization if you have headroom:

python -m vllm.entrypoints.openai.api_server \
  --model Qwen/Qwen3.5-7B-Instruct \
  --gpu-memory-utilization 0.95

When to use vLLM vs. the hosted chat

vLLM shines when you have specific infrastructure requirements: multi-user serving, API integration into existing systems, or the need to keep all data on your own machines. But it comes with real overhead in setup and maintenance.

If you are still evaluating which Qwen 3.5 model fits your needs, start with the browser. You can try Qwen 3.5 free to test different model sizes and behaviors before committing to a self-hosted deployment. Once you know the right model and the traffic patterns justify it, vLLM becomes one of the best ways to serve it.

Quick FAQ

Does vLLM support Qwen 3.5 MoE models?

Yes. vLLM supports MoE architectures including Qwen 3.5 MoE variants. The expert routing is handled internally.

How much VRAM do I need?

For the 7B model in fp16, expect around 14-16 GB. The 32B model needs roughly 64 GB or more. Quantized variants reduce this significantly.

Can I use vLLM with the GGUF format?

vLLM primarily works with Hugging Face model formats. For GGUF files, consider using llama.cpp instead.

Is vLLM faster than Hugging Face Transformers?

For serving workloads with concurrent requests, yes, often significantly. For single-request inference in a notebook, the difference is smaller.

Q-Chat Team

Q-Chat Team

How to Run Qwen 3.5 with vLLM: Setup Guide | Qwen Blog