Blog Article

Fine-Tune Qwen 3.5 with Unsloth: Step-by-Step Guide

A practical guide to fine-tuning Qwen 3.5 with Unsloth, covering installation, LoRA and QLoRA setup, training configuration, and exporting your fine-tuned model.

Fine-Tune Qwen 3.5 with Unsloth: Step-by-Step Guide

Fine-Tune Qwen 3.5 with Unsloth: Step-by-Step Guide

Fine-tuning is the fastest way to make a general-purpose model work better on your specific tasks. If you have been looking into unsloth qwen3.5 or unsloth qwen 3.5, you are on the right track. Unsloth is one of the most efficient tools for fine-tuning open-weight models, and it works well with Qwen 3.5.

This guide walks through the entire process: what Unsloth is, why it pairs well with Qwen 3.5, and a step-by-step setup from installation to exporting your trained model. If you want to test Qwen 3.5 before investing time in fine-tuning, you can Try Qwen 3.5 free first.

What Is Unsloth

Unsloth is an open-source library that makes fine-tuning large language models significantly faster and more memory-efficient. It achieves this through custom CUDA kernels and optimized implementations of key training operations.

The practical benefits are straightforward:

  • 2x faster training compared to standard Hugging Face training loops
  • Up to 60% less memory usage, which means you can fine-tune larger models on the same GPU
  • No accuracy loss from the optimizations
  • Easy integration with the Hugging Face ecosystem, including Transformers, TRL, and PEFT

For Qwen 3.5 specifically, Unsloth supports the model architecture out of the box, so you do not need custom patches or workarounds.

Why Use Unsloth for Qwen 3.5

The main reason is efficiency. Qwen 3.5 models range from 1.5B to much larger sizes, and fine-tuning even the smaller variants benefits from reduced memory usage and faster iteration.

If you are working with a single consumer GPU (like an RTX 3090 or 4090), Unsloth can be the difference between being able to fine-tune a 7B-class model or not. Without it, you might need to rent cloud GPUs or settle for a smaller model size than you wanted.

Unsloth also simplifies the LoRA and QLoRA workflow, which is the most common approach for fine-tuning Qwen 3.5 without retraining all the model weights.

Prerequisites

Before starting, you need:

  • A machine with an NVIDIA GPU (16GB+ VRAM recommended for 7B-class models)
  • Python 3.10 or later
  • CUDA 11.8 or later
  • Basic familiarity with Python and the command line

For smaller Qwen 3.5 variants (1.5B, 4B), you can get started with as little as 8GB VRAM when using QLoRA.

Step 1: Install Unsloth

The cleanest installation method uses pip. Create a fresh virtual environment first:

python -m venv unsloth-env
source unsloth-env/bin/activate
pip install unsloth

Unsloth will pull in its dependencies including the correct versions of Transformers, PEFT, and TRL. If you run into CUDA version issues, check the Unsloth GitHub repository for version-specific installation instructions.

Step 2: Load the Qwen 3.5 Model

Unsloth provides a FastLanguageModel class that handles model loading with its optimizations applied automatically:

from unsloth import FastLanguageModel

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="Qwen/Qwen3.5-7B",
    max_seq_length=4096,
    load_in_4bit=True,  # Use QLoRA for memory efficiency
)

Setting load_in_4bit=True enables QLoRA, which quantizes the base model to 4-bit precision while keeping the trainable LoRA weights in higher precision. This dramatically reduces memory usage with minimal impact on quality.

For the model name, use the appropriate Hugging Face identifier for the Qwen 3.5 variant you want to fine-tune. Common options include:

  • Qwen/Qwen3.5-1.5B for the smallest variant
  • Qwen/Qwen3.5-7B for a good balance of capability and efficiency
  • Qwen/Qwen3.5-14B if you have more GPU memory available

Step 3: Configure LoRA

Next, add LoRA adapters to the model. These are the small trainable layers that get fine-tuned while the base model weights stay frozen:

model = FastLanguageModel.get_peft_model(
    model,
    r=16,              # LoRA rank
    lora_alpha=16,     # LoRA scaling factor
    lora_dropout=0,    # Dropout (0 is recommended by Unsloth)
    target_modules=[
        "q_proj", "k_proj", "v_proj", "o_proj",
        "gate_proj", "up_proj", "down_proj",
    ],
    bias="none",
    use_gradient_checkpointing="unsloth",  # Memory-efficient checkpointing
)

Key parameters to understand:

  • r (rank): Controls the size of the LoRA adapters. Higher values capture more complexity but use more memory. 16 is a good starting point.
  • target_modules: Which layers to apply LoRA to. The list above covers the main attention and feed-forward layers.
  • use_gradient_checkpointing: Set to "unsloth" for the optimized version that uses less memory than standard gradient checkpointing.

Step 4: Prepare Your Training Data

Your training data needs to be formatted as conversations or instruction-response pairs. Here is a simple example using the Alpaca format:

from datasets import load_dataset

dataset = load_dataset("your-dataset-name", split="train")

# Format as chat template
def format_prompt(example):
    return {
        "text": tokenizer.apply_chat_template(
            [
                {"role": "user", "content": example["instruction"]},
                {"role": "assistant", "content": example["output"]},
            ],
            tokenize=False,
        )
    }

dataset = dataset.map(format_prompt)

The key is to use the tokenizer's apply_chat_template method so your training data matches the format the model expects during inference.

Step 5: Train

Set up the trainer using TRL's SFTTrainer:

from trl import SFTTrainer
from transformers import TrainingArguments

trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=dataset,
    dataset_text_field="text",
    max_seq_length=4096,
    args=TrainingArguments(
        per_device_train_batch_size=2,
        gradient_accumulation_steps=4,
        warmup_steps=10,
        num_train_epochs=1,
        learning_rate=2e-4,
        fp16=True,
        logging_steps=10,
        output_dir="outputs",
        seed=42,
    ),
)

trainer.train()

Adjust per_device_train_batch_size and gradient_accumulation_steps based on your available memory. If you run out of GPU memory, reduce the batch size first.

Step 6: Export the Model

After training, you have several export options:

Save LoRA adapters only

model.save_pretrained("qwen35-lora")
tokenizer.save_pretrained("qwen35-lora")

This saves just the small LoRA adapter weights. To use the model later, you load the base model and apply the adapters.

Merge and save full model

model.save_pretrained_merged("qwen35-merged", tokenizer)

This merges the LoRA weights into the base model and saves the full result. The file is larger but simpler to deploy.

Export to GGUF for Ollama

model.save_pretrained_gguf("qwen35-gguf", tokenizer, quantization_method="q4_k_m")

This exports directly to GGUF format, which you can load into Ollama or llama.cpp for local inference. The q4_k_m quantization provides a good balance of size and quality.

Common Issues and Tips

  • Out of memory: Reduce batch size, use QLoRA (load_in_4bit=True), or switch to a smaller Qwen 3.5 variant.
  • Slow training: Make sure Unsloth's optimizations are active. You should see a message during model loading confirming this.
  • Poor results: Check your data quality first. Fine-tuning amplifies the patterns in your data, good and bad.
  • Chat format issues: Always use apply_chat_template to format training data. Mismatched formats are one of the most common sources of poor fine-tuning results.

What to Fine-Tune On

The most successful fine-tuning projects start with a clear, narrow task. Broad improvements are hard to achieve; specific improvements are much more tractable. Good candidates include:

  • Consistent output formatting for your application
  • Domain-specific knowledge or terminology
  • Particular writing style or tone
  • Task-specific behavior (classification, extraction, summarization)

Before fine-tuning, make sure the base model cannot already do what you need. Try Qwen 3.5 free with careful prompting first. Sometimes prompt engineering is enough, and it is much cheaper than fine-tuning.

FAQ

Which Qwen 3.5 size should I fine-tune?

Start with the smallest model that handles your task reasonably well when prompted. Fine-tuning a 7B model is much faster and cheaper than fine-tuning a 14B model, and for many tasks the smaller model is sufficient after fine-tuning.

How much training data do I need?

For LoRA fine-tuning, you can see meaningful results with as few as 100-500 high-quality examples. More data helps, but quality matters more than quantity.

Can I fine-tune on a Mac?

Unsloth currently requires NVIDIA GPUs. For Mac-based fine-tuning, look into MLX-based alternatives, though they are generally less mature.

How long does training take?

With Unsloth on a single RTX 4090, fine-tuning a 7B model on 1000 examples typically takes 15-30 minutes for one epoch. Larger models and datasets scale proportionally.

Q-Chat Team

Q-Chat Team

Fine-Tune Qwen 3.5 with Unsloth: Step-by-Step Guide | Qwen Blog