Introduction
Fine-tuning allows you to customize a pre-trained LLM for your specific use case, improving performance on domain-specific tasks without the cost of training from scratch. This guide covers the latest best practices for fine-tuning in 2026, including when to fine-tune, which techniques to use, and common pitfalls to avoid.
Prerequisites
- Python 3.10+
- GPU with 24GB+ VRAM (for LoRA) or 80GB+ (for full fine-tuning)
- Familiarity with PyTorch and Hugging Face Transformers
- A curated dataset of at least 1,000 examples
When to Fine-Tune (and When Not To)
Fine-tune when:
- You need consistent output formatting
- Domain-specific terminology or knowledge is critical
- RAG alone does not provide sufficient accuracy
- You need to reduce latency (smaller fine-tuned model vs. large general model)
Do NOT fine-tune when:
- Your task can be solved with good prompting
- RAG provides adequate results
- You have fewer than 500 quality training examples
- You need the model to generalize broadly
Step 1: Data Preparation
The most critical step. Quality data trumps quantity every time.
import json
# Format: list of conversation examples
training_data = [
{
"messages": [
{"role": "system", "content": "You are a medical coding assistant."},
{"role": "user", "content": "What ICD-10 code for acute bronchitis?"},
{"role": "assistant", "content": "The ICD-10 code for acute bronchitis is J20.9 (Acute bronchitis, unspecified)."}
]
},
# ... more examples
]
# Save as JSONL
with open("training_data.jsonl", "w") as f:
for example in training_data:
f.write(json.dumps(example) + "\n")Data Quality Checklist
- Each example demonstrates the exact behavior you want
- Responses are accurate and well-formatted
- Diverse inputs covering edge cases
- No contradictory examples
- Consistent formatting across all examples
Step 2: Choose Your Fine-Tuning Method
LoRA (Low-Rank Adaptation)
Best for most use cases. Trains only a small number of adapter parameters.
from peft import LoraConfig, get_peft_model
lora_config = LoraConfig(
r=16, # Rank of the update matrices
lora_alpha=32, # Scaling factor
target_modules=["q_proj", "v_proj", "k_proj", "o_proj"],
lora_dropout=0.05,
bias="none",
task_type="CAUSAL_LM",
)
model = get_peft_model(base_model, lora_config)
model.print_trainable_parameters()
# Output: trainable params: 8M / total: 7B = 0.11%QLoRA (Quantized LoRA)
For when you have limited VRAM. Loads the base model in 4-bit quantization.
from transformers import BitsAndBytesConfig
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16,
bnb_4bit_use_double_quant=True,
)Step 3: Training Configuration
from transformers import TrainingArguments
training_args = TrainingArguments(
output_dir="./output",
num_train_epochs=3,
per_device_train_batch_size=4,
gradient_accumulation_steps=4,
learning_rate=2e-4,
warmup_ratio=0.1,
lr_scheduler_type="cosine",
logging_steps=10,
save_strategy="epoch",
evaluation_strategy="epoch",
bf16=True,
)Step 4: Evaluation
Always evaluate on a held-out test set:
from datasets import load_dataset
# Split data: 90% train, 10% eval
dataset = load_dataset("json", data_files="training_data.jsonl")
split = dataset["train"].train_test_split(test_size=0.1)
# Run evaluation after training
results = trainer.evaluate(split["test"])
print(f"Eval Loss: {results['eval_loss']:.4f}")Common Pitfalls
- Overfitting: Too many epochs on small datasets. Use early stopping.
- Catastrophic forgetting: Model loses general capabilities. Keep learning rate low.
- Data contamination: Test data leaking into training. Always use proper splits.
- Wrong format: Inconsistent chat template formatting causes poor results.
- Too little data: Under 500 examples usually produces unreliable results.
Troubleshooting
- Loss not decreasing: Learning rate too low, or data format issues
- Gibberish output: Learning rate too high, or training too long
- Model ignores fine-tuning: Adapter weights not loaded correctly
- OOM errors: Reduce batch size, enable gradient checkpointing, or use QLoRA
Conclusion
Fine-tuning is a powerful tool when applied correctly. Start with LoRA, use high-quality data, and always evaluate thoroughly. The combination of RAG for knowledge and fine-tuning for behavior produces the best results in production applications.
Key Takeaways
- Data quality matters more than quantity
- LoRA is sufficient for most fine-tuning needs
- Always evaluate on held-out data
- Consider RAG as an alternative before fine-tuning