Fine-tuning adapts a general-purpose model to your specific domain — your writing style, your data format, your terminology. With QLoRA (Quantized Low-Rank Adaptation), you can fine-tune models up to 70B parameters on a single consumer GPU with 24GB VRAM.
Prerequisites
- Linux system with an NVIDIA GPU (16GB+ VRAM recommended)
- Python 3.10+
- CUDA toolkit installed
- A dataset in JSONL format (at least 100 examples, ideally 1000+)
Step 1: Install Dependencies
pip install torch transformers datasets peft bitsandbytes accelerate trl
Step 2: Prepare Your Dataset
Create a JSONL file where each line contains an instruction-response pair:
{"instruction": "Summarize this code review", "input": "The PR adds a new caching layer...", "output": "This PR implements Redis caching..."}
{"instruction": "Write a commit message", "input": "Changed auth timeout from 30s to 60s", "output": "fix: increase auth timeout to 60s to reduce session drops"}
Save as training_data.jsonl. Quality matters more than quantity — 500 high-quality examples beat 10,000 noisy ones.
Step 3: Write the Training Script
Create train.py:
from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
from trl import SFTTrainer, SFTConfig
import torch
# Configuration
MODEL_NAME = "meta-llama/Llama-3.2-3B-Instruct" # or any HF model
OUTPUT_DIR = "./fine-tuned-model"
# 4-bit quantization config
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16,
)
# Load model and tokenizer
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
tokenizer.pad_token = tokenizer.eos_token
model = AutoModelForCausalLM.from_pretrained(
MODEL_NAME,
quantization_config=bnb_config,
device_map="auto",
)
model = prepare_model_for_kbit_training(model)
# LoRA config
lora_config = LoraConfig(
r=16,
lora_alpha=32,
target_modules=["q_proj", "v_proj", "k_proj", "o_proj"],
lora_dropout=0.05,
bias="none",
task_type="CAUSAL_LM",
)
model = get_peft_model(model, lora_config)
# Load dataset
dataset = load_dataset("json", data_files="training_data.jsonl", split="train")
def format_example(example):
return f"### Instruction:\n{example['instruction']}\n\n### Input:\n{example.get('input','')}\n\n### Response:\n{example['output']}"
# Training config
training_config = SFTConfig(
output_dir=OUTPUT_DIR,
num_train_epochs=3,
per_device_train_batch_size=4,
gradient_accumulation_steps=4,
learning_rate=2e-4,
logging_steps=10,
save_steps=100,
max_seq_length=2048,
)
# Train
trainer = SFTTrainer(
model=model,
train_dataset=dataset,
formatting_func=lambda x: [format_example(x)],
args=training_config,
)
trainer.train()
trainer.save_model(OUTPUT_DIR)
print(f"Model saved to {OUTPUT_DIR}")
Step 4: Run Training
python train.py
Training time depends on dataset size and hardware. With 1000 examples on a 24GB GPU, expect 30-60 minutes for 3 epochs.
Step 5: Convert for Ollama
To use your fine-tuned model with Ollama, merge the LoRA weights and convert to GGUF:
# Merge LoRA weights
python -c "
from peft import AutoPeftModelForCausalLM
model = AutoPeftModelForCausalLM.from_pretrained('./fine-tuned-model')
model = model.merge_and_unload()
model.save_pretrained('./merged-model')
"
# Convert to GGUF (using llama.cpp)
python llama.cpp/convert_hf_to_gguf.py ./merged-model --outfile model.gguf --outtype q4_k_m
Testing It
Create a Modelfile for Ollama:
echo 'FROM ./model.gguf' > Modelfile
ollama create my-model -f Modelfile
ollama run my-model "Your test prompt here"
What is Next
Experiment with different LoRA ranks (r=8 to r=64), try different base models, and build evaluation scripts to measure your fine-tuned model against the base model on your specific tasks.
Frequently Asked Questions
How much data do I need?
Minimum viable: 100 high-quality examples for noticeable style adaptation. For reliable task performance: 500-1000 examples. For production: 2000+ examples with evaluation sets.
Can I fine-tune without a GPU?
QLoRA specifically requires a CUDA GPU. For CPU-only fine-tuning, look into smaller models (under 3B parameters) with standard LoRA, though training will be significantly slower.