Fine-Tune a Language Model Locally With QLoRA

Fine-tuning adapts a general-purpose model to your specific domain — your writing style, your data format, your terminology. With QLoRA (Quantized Low-Rank Adaptation), you can fine-tune models up to 70B parameters on a single consumer GPU with 24GB VRAM.

Prerequisites

Linux system with an NVIDIA GPU (16GB+ VRAM recommended)
Python 3.10+
CUDA toolkit installed
A dataset in JSONL format (at least 100 examples, ideally 1000+)

Step 1: Install Dependencies

pip install torch transformers datasets peft bitsandbytes accelerate trl

Step 2: Prepare Your Dataset

Create a JSONL file where each line contains an instruction-response pair:

{"instruction": "Summarize this code review", "input": "The PR adds a new caching layer...", "output": "This PR implements Redis caching..."}
{"instruction": "Write a commit message", "input": "Changed auth timeout from 30s to 60s", "output": "fix: increase auth timeout to 60s to reduce session drops"}

Save as training_data.jsonl. Quality matters more than quantity — 500 high-quality examples beat 10,000 noisy ones.

Step 3: Write the Training Script

Create train.py:

from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
from trl import SFTTrainer, SFTConfig
import torch

# Configuration
MODEL_NAME = "meta-llama/Llama-3.2-3B-Instruct"  # or any HF model
OUTPUT_DIR = "./fine-tuned-model"

# 4-bit quantization config
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
)

# Load model and tokenizer
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
tokenizer.pad_token = tokenizer.eos_token

model = AutoModelForCausalLM.from_pretrained(
    MODEL_NAME,
    quantization_config=bnb_config,
    device_map="auto",
)
model = prepare_model_for_kbit_training(model)

# LoRA config
lora_config = LoraConfig(
    r=16,
    lora_alpha=32,
    target_modules=["q_proj", "v_proj", "k_proj", "o_proj"],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM",
)

model = get_peft_model(model, lora_config)

# Load dataset
dataset = load_dataset("json", data_files="training_data.jsonl", split="train")

def format_example(example):
    return f"### Instruction:\n{example['instruction']}\n\n### Input:\n{example.get('input','')}\n\n### Response:\n{example['output']}"

# Training config
training_config = SFTConfig(
    output_dir=OUTPUT_DIR,
    num_train_epochs=3,
    per_device_train_batch_size=4,
    gradient_accumulation_steps=4,
    learning_rate=2e-4,
    logging_steps=10,
    save_steps=100,
    max_seq_length=2048,
)

# Train
trainer = SFTTrainer(
    model=model,
    train_dataset=dataset,
    formatting_func=lambda x: [format_example(x)],
    args=training_config,
)

trainer.train()
trainer.save_model(OUTPUT_DIR)
print(f"Model saved to {OUTPUT_DIR}")

Step 4: Run Training

python train.py

Training time depends on dataset size and hardware. With 1000 examples on a 24GB GPU, expect 30-60 minutes for 3 epochs.

Step 5: Convert for Ollama

To use your fine-tuned model with Ollama, merge the LoRA weights and convert to GGUF:

# Merge LoRA weights
python -c "
from peft import AutoPeftModelForCausalLM
model = AutoPeftModelForCausalLM.from_pretrained('./fine-tuned-model')
model = model.merge_and_unload()
model.save_pretrained('./merged-model')
"

# Convert to GGUF (using llama.cpp)
python llama.cpp/convert_hf_to_gguf.py ./merged-model --outfile model.gguf --outtype q4_k_m

Testing It

Create a Modelfile for Ollama:

echo 'FROM ./model.gguf' > Modelfile
ollama create my-model -f Modelfile
ollama run my-model "Your test prompt here"

What is Next

Experiment with different LoRA ranks (r=8 to r=64), try different base models, and build evaluation scripts to measure your fine-tuned model against the base model on your specific tasks.

Frequently Asked Questions

How much data do I need?

Minimum viable: 100 high-quality examples for noticeable style adaptation. For reliable task performance: 500-1000 examples. For production: 2000+ examples with evaluation sets.

Can I fine-tune without a GPU?

QLoRA specifically requires a CUDA GPU. For CPU-only fine-tuning, look into smaller models (under 3B parameters) with standard LoRA, though training will be significantly slower.