Blog Article

Run Qwen 3.5 Locally: Complete Setup Guide

Everything you need to run Qwen 3.5 locally: hardware requirements by model size, setup with Ollama, vLLM, llama.cpp, and Transformers, plus performance optimization tips.

Run Qwen 3.5 Locally: Complete Setup Guide

Run Qwen 3.5 Locally: Complete Setup Guide

Running a language model on your own hardware gives you full control over privacy, latency, and cost. If you have been searching for qwen 3.5 local or qwen 3.5 download, this guide covers everything you need to get Qwen 3.5 running on your machine.

We will cover hardware requirements, the main setup options, the easiest path to get started, and tips for getting the best performance. If you want to test Qwen 3.5 before setting up locally, you can Try Qwen 3.5 free in the browser first.

Hardware Requirements by Model Size

The single biggest factor in local deployment is whether the model fits in your available memory. Here is a practical guide:

Qwen3.5-1.5B

  • RAM/VRAM needed: ~2GB quantized, ~3GB full precision
  • Runs on: Almost any modern machine, including laptops
  • Good for: Quick experiments, simple tasks, testing pipelines

Qwen3.5-4B

  • RAM/VRAM needed: ~3GB quantized, ~8GB full precision
  • Runs on: Most machines with a dedicated GPU or 16GB+ system RAM
  • Good for: Everyday tasks, coding assistance, chat

Qwen3.5-7B

  • RAM/VRAM needed: ~5GB quantized (Q4), ~14GB full precision
  • Runs on: Machines with a mid-range GPU (RTX 3060 12GB+) or 16GB+ system RAM for CPU inference
  • Good for: General-purpose use, strong coding, reasoning tasks

Qwen3.5-14B

  • RAM/VRAM needed: ~9GB quantized, ~28GB full precision
  • Runs on: Machines with a higher-end GPU (RTX 3090, 4080+) or 32GB+ system RAM
  • Good for: More complex tasks, better instruction following, longer outputs

Qwen3.5-27B and Larger

  • RAM/VRAM needed: ~16GB+ quantized, ~54GB+ full precision
  • Runs on: High-end workstations, multi-GPU setups, or machines with 64GB+ system RAM
  • Good for: Best quality, complex reasoning, professional use cases

The quantized sizes assume Q4_K_M quantization, which is the most common balance of quality and size. CPU inference is always an option but will be significantly slower than GPU inference.

Setup Option 1: Ollama (Easiest)

Ollama is the simplest way to run Qwen 3.5 locally. It handles model downloading, quantization, and serving in one tool.

Install Ollama

# macOS or Linux
curl -fsSL https://ollama.com/install.sh | sh

# Or download from https://ollama.com for macOS/Windows

Download and Run Qwen 3.5

# Pull and start the 7B model
ollama run qwen3.5:7b

# Or a smaller variant
ollama run qwen3.5:1.5b

# Or a larger one
ollama run qwen3.5:14b

That is it. Ollama downloads the model, sets up the runtime, and drops you into an interactive chat. You can also use the API:

curl http://localhost:11434/api/chat -d '{
  "model": "qwen3.5:7b",
  "messages": [{"role": "user", "content": "Hello"}]
}'

Ollama is the best starting point if you want to get running in under five minutes.

Setup Option 2: llama.cpp (Most Flexible)

llama.cpp gives you more control over quantization, context length, and inference parameters. It runs on CPU and GPU, and supports GGUF model files.

Install llama.cpp

git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
make -j

For GPU acceleration, build with CUDA or Metal support:

# CUDA (NVIDIA)
make -j GGML_CUDA=1

# Metal (Apple Silicon)
make -j GGML_METAL=1

Download a GGUF Model

Find Qwen 3.5 GGUF files on Hugging Face. Look for repositories with "GGUF" in the name. Download the quantization level that fits your hardware:

  • Q4_K_M: Good balance of quality and size (recommended default)
  • Q5_K_M: Slightly better quality, slightly larger
  • Q8_0: Near full precision quality, requires more memory
  • Q2_K: Smallest size, noticeable quality reduction

Run Inference

./llama-cli -m qwen3.5-7b-q4_k_m.gguf \
  -p "Write a Python function to sort a list" \
  -n 512 \
  --ctx-size 4096

For an interactive chat session:

./llama-cli -m qwen3.5-7b-q4_k_m.gguf \
  --interactive \
  --ctx-size 4096

Setup Option 3: Hugging Face Transformers (Best for Python)

If you prefer working in Python and want full control over the model, Transformers is the standard choice.

Install Dependencies

pip install transformers torch accelerate

Load and Run

from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "Qwen/Qwen3.5-7B-Instruct"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype="auto",
    device_map="auto",
)

messages = [{"role": "user", "content": "Explain quicksort in simple terms"}]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt").to(model.device)

outputs = model.generate(**inputs, max_new_tokens=512)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

This approach gives you the most flexibility for integration into Python applications, pipelines, and notebooks.

Setup Option 4: vLLM (Best for Serving)

If you need to serve Qwen 3.5 as an API for multiple users or applications, vLLM is the most efficient option.

Install and Run

pip install vllm

# Start the server
vllm serve Qwen/Qwen3.5-7B-Instruct

vLLM provides an OpenAI-compatible API out of the box, so you can use it as a drop-in replacement in applications that support the OpenAI API format.

curl http://localhost:8000/v1/chat/completions -d '{
  "model": "Qwen/Qwen3.5-7B-Instruct",
  "messages": [{"role": "user", "content": "Hello"}]
}'

Performance Tips

Use the Right Quantization

For most local use cases, Q4_K_M quantization provides the best balance. The quality difference compared to full precision is small for most tasks, and the memory savings are substantial.

Set Context Length Appropriately

Do not set the context window larger than you need. Larger context windows use more memory. If your typical conversations are under 4K tokens, there is no reason to allocate 32K.

GPU Offloading

If your model does not fully fit in GPU memory, most tools support partial GPU offloading. Load as many layers as possible on the GPU and let the rest run on CPU. This is slower than full GPU inference but much faster than pure CPU.

Batch Size

If you are running inference for multiple requests, tools like vLLM can batch them together for much better throughput than processing one at a time.

Apple Silicon Considerations

On Macs with M-series chips, the unified memory architecture means system RAM and GPU memory are shared. An M1 Max with 64GB can run models that would normally require a dedicated GPU with similar VRAM. Use llama.cpp with Metal support or Ollama for the best experience on Apple Silicon.

Which Path Should You Choose?

  • Just want to chat locally: Use Ollama
  • Want maximum control: Use llama.cpp
  • Building a Python application: Use Transformers
  • Serving an API: Use vLLM

If you are not sure yet which Qwen 3.5 model size is right for your tasks, the fastest way to figure that out is to Try Qwen 3.5 free in the browser. Test your actual prompts, find the right model size, then bring it local.

FAQ

Can I run Qwen 3.5 without a GPU?

Yes. All of the tools above support CPU-only inference. It will be slower, but it works. For the 7B model with Q4 quantization, expect around 5-10 tokens per second on a modern CPU with 16GB+ RAM.

Which model size gives the best balance?

The 7B variant is the sweet spot for most people. It fits on a single consumer GPU, runs at reasonable speed, and handles most tasks well. Go smaller if you need speed, go larger if quality is the priority.

How much disk space do I need?

A Q4_K_M quantized 7B model is roughly 4-5GB. The 14B model is roughly 8-9GB. Plan for download time and temporary storage during model loading as well.

Can I run multiple models at once?

Yes, if you have enough memory. Ollama makes it particularly easy to switch between models. Each loaded model consumes memory proportional to its size.

Q-Chat Team

Q-Chat Team

Run Qwen 3.5 Locally: Complete Setup Guide | Qwen Blog