Fine-Tuning LLMs — Full Fine-Tuning vs PEFT, LoRA, Dataset Prep, Training Frameworks & Deployment
Fine-tuning adapts a pre-trained large language model (LLM) to your specific domain or task — instead of prompting GPT-4 for every answer, you train a smaller model on your data so it internalizes your domain knowledge, formatting preferences, and behavior patterns.
What You’ll Learn
- Full fine-tuning vs. Parameter-Efficient Fine-Tuning (PEFT) — trade-offs and when to use each
- LoRA and QLoRA for low-memory fine-tuning
- Dataset preparation: formatting, quality filtering, deduplication, and augmentation
- Training frameworks: Axolotl, Unsloth, and HuggingFace TRL
- Evaluation metrics: perplexity, BLEU, ROUGE, and human evaluation
- Deployment with vLLM, TGI, and Ollama
- Overfitting prevention and cost considerations
Why Fine-Tuning Matters
Base models are generalists — they know a bit about everything. Fine-tuning turns them into specialists. A fine-tuned model on legal documents outperforms GPT-4 on contract analysis at a fraction of the cost. Fine-tuned code models complete your company’s API patterns correctly. Every enterprise AI strategy includes fine-tuning as the bridge between general-purpose models and domain-specific needs.
DodaZIP uses fine-tuned models to recognize compression patterns specific to proprietary file formats. Durga Antivirus Pro fine-tunes threat detection models on new malware families weekly, adapting faster than signature-based detection.
Learning Path
flowchart LR
A["OpenAI API Guide"] --> B["Vector Databases"]
B --> C["Fine-Tuning LLMs<br/>You are here"]
C --> D["PEFT Methods"]
C --> E["Deployment"]
D --> F["AI Agents"]
style C fill:#f90,color:#fff
Full Fine-Tuning vs. PEFT
| Aspect | Full Fine-Tuning | PEFT (LoRA, QLoRA) |
|---|---|---|
| Parameters updated | All | 0.1-1% (adapter modules) |
| Memory required | Very high (4-8x model size) | Low (1.2-2x model size) |
| Training time | Days to weeks | Hours to days |
| Output | Full model weights | Small adapter file (MB size) |
| Quality | Highest possible | Near full fine-tuning quality |
| Use case | Large compute budget, maximum accuracy | Limited GPU, rapid iteration |
When to Choose Each
- Full fine-tuning: You have 8+ GPUs (A100 80GB), need maximum accuracy, and have weeks of training time.
- PEFT/LoRA: You have 1-4 GPUs (consumer or cloud), need fast iteration, and can accept <2% quality gap.
- QLoRA: You have a single consumer GPU (RTX 3090, 4090) with 24GB VRAM and want to fine-tune 7B-13B models.
LoRA — Low-Rank Adaptation
LoRA injects trainable rank-decomposition matrices into the transformer layers. Instead of updating the full weight matrix W (huge), it trains two smaller matrices A and B.
Original: h = Wx (W has d×k parameters)
LoRA: h = Wx + BAx (B has d×r, A has r×k — r is rank, typically 8-64)
Parameters saved: d×k → 2×d×r (when r << min(d,k))
For Llama 2 7B: full update = 7B params, LoRA update ≈ 4.2M params (0.06%)from peft import LoraConfig, get_peft_model
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-7b-hf")
lora_config = LoraConfig(
r=16, # Rank
lora_alpha=32, # Scaling factor
target_modules=["q_proj", "v_proj", "k_proj", "o_proj"],
lora_dropout=0.05,
bias="none",
task_type="CAUSAL_LM",
)
peft_model = get_peft_model(model, lora_config)
print(f"Trainable params: {peft_model.num_parameters(only_trainable=True):,}")
print(f"Total params: {peft_model.num_parameters():,}")Expected output:
Trainable params: 4,194,304
Total params: 6,738,415,616QLoRA — Quantized LoRA
QLoRA quantizes the base model to 4-bit (NF4 format) and adds LoRA adapters on top. This allows fine-tuning a 7B model on a 24GB GPU.
from transformers import BitsAndBytesConfig
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype="float16",
bnb_4bit_use_double_quant=True,
)
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-2-7b-hf",
quantization_config=bnb_config,
device_map="auto",
)
# Then apply LoRA config as above — now 7B model fits in 24GBDataset Preparation
Quality matters more than quantity. 1000 high-quality examples beat 100,000 noisy ones.
Format Types
# Instruction format
{"instruction": "Explain what a vector database is.",
"output": "A vector database stores embeddings for similarity search."}
# Chat format
{"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What is LoRA?"},
{"role": "assistant", "content": "LoRA is a parameter-efficient fine-tuning method..."}
]}
# Completion format
{"prompt": "Q: What is fine-tuning?\nA:",
"completion": " Fine-tuning adapts a pre-trained model to a specific task."}Quality Filtering Steps
# Dataset quality pipeline
def filter_dataset(records):
cleaned = []
for rec in records:
text = rec.get("output", rec.get("completion", ""))
# Remove empty outputs
if len(text.strip()) < 10:
continue
# Remove with excessive repetition
if len(set(text.split())) / len(text.split()) < 0.3:
continue
# Remove toxic content
if any(word in text.lower() for word in ["profanity_list"]):
continue
# Remove near-duplicates (cosine similarity > 0.95)
cleaned.append(rec)
return cleanedDeduplication
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np
model = SentenceTransformer("all-MiniLM-L6-v2")
def deduplicate(records, threshold=0.95):
texts = [r.get("output", r.get("completion", "")) for r in records]
embeddings = model.encode(texts)
keep = []
for i in range(len(records)):
if not keep:
keep.append(i)
continue
sims = cosine_similarity([embeddings[i]], embeddings[keep])
if sims.max() < threshold:
keep.append(i)
return [records[i] for i in keep]
print(f"Before dedup: 10000 records")
print(f"After dedup: {len(deduplicate([{'output': 'test'}] * 10000, threshold=0.95))} records")Expected output: Deduplication removes near-identical records, typically reducing a raw dataset by 10-30%.
Training Frameworks
Axolotl
Configuration-driven fine-tuning — write a YAML config, run one command.
# axolotl_config.yml
base_model: mistralai/Mistral-7B-v0.1
model_type: MistralForCausalLM
tokenizer_type: LlamaTokenizer
load_in_8bit: false
load_in_4bit: true
strict: false
datasets:
- path: my_dataset.jsonl
type: sharegpt
conversation: llama2
dataset_prepared_path: last_run_prepared
val_set_size: 0.05
output_dir: ./lora-out
sequence_len: 2048
sample_packing: true
lora_r: 16
lora_alpha: 32
lora_dropout: 0.05
lora_target_modules:
- q_proj
- v_proj
- k_proj
- o_proj
train_on_inputs: false
group_by_length: false
gradient_accumulation_steps: 4
micro_batch_size: 4
num_epochs: 3
optimizer: adamw_bnb_8bit
lr_scheduler: cosine
learning_rate: 2e-4
wandb_project: my-fine-tune
wandb_watch: gradients
bf16: auto
fp16: falseaccelerate launch -m axolotl.cli.train axolotl_config.ymlUnsloth
Optimized LoRA/QLoRA with 2x faster training and 50% less memory.
from unsloth import FastLanguageModel
model, tokenizer = FastLanguageModel.from_pretrained(
model_name="unsloth/mistral-7b-bnb-4bit",
max_seq_length=2048,
dtype=None,
load_in_4bit=True,
)
model = FastLanguageModel.get_peft_model(
model,
r=16,
target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
lora_alpha=16,
lora_dropout=0,
bias="none",
use_gradient_checkpointing="unsloth",
)
# Train with standard HuggingFace Trainer
from transformers import TrainingArguments
trainer = Trainer(
model=model,
train_dataset=dataset,
args=TrainingArguments(
per_device_train_batch_size=4,
gradient_accumulation_steps=4,
num_train_epochs=3,
learning_rate=2e-4,
fp16=not torch.cuda.is_bf16_supported(),
logging_steps=10,
output_dir="outputs",
),
)
trainer.train()HuggingFace TRL
Transformer Reinforcement Learning — for RLHF and DPO.
from trl import SFTTrainer
trainer = SFTTrainer(
model=model,
train_dataset=dataset,
dataset_text_field="text",
max_seq_length=2048,
args=TrainingArguments(
per_device_train_batch_size=4,
gradient_accumulation_steps=4,
learning_rate=2e-4,
num_train_epochs=3,
output_dir="sft-output",
),
)
trainer.train()Evaluation Metrics
| Metric | What It Measures | Range | Interpretation |
|---|---|---|---|
| Perplexity | Model confidence on held-out data | 1-∞ | Lower is better. GPT-4: ~10, Llama 2 7B: ~8 |
| BLEU | N-gram overlap with reference | 0-100 | Higher is better. Best for translation |
| ROUGE-L | Longest common subsequence | 0-100 | Higher is better. Best for summarization |
| Human Eval | Expert rating of outputs | 1-5 | Gold standard but expensive |
from evaluate import load
# Load metrics
perplexity = load("perplexity", module_type="metric")
bleu = load("bleu")
rouge = load("rouge")
# Example: evaluate generated text
generated = "Fine-tuning adapts models to specific tasks."
reference = "Fine-tuning adapts pre-trained models to domain-specific tasks."
results = rouge.compute(
predictions=[generated],
references=[reference]
)
print(f"ROUGE-L: {results['rougeL']:.4f}")Expected output:
ROUGE-L: 0.7857Deployment
vLLM — High-Throughput Serving
from vllm import LLM, SamplingParams
# Load fine-tuned model with LoRA adapter
llm = LLM(
model="mistralai/Mistral-7B-v0.1",
enable_lora=True,
max_lora_rank=64,
)
sampling_params = SamplingParams(temperature=0.7, max_tokens=256)
# Apply LoRA adapter at inference time
outputs = llm.generate(
["Explain fine-tuning in one sentence."],
sampling_params,
lora_request=None, # Specify LoRA adapter path
)
print(outputs[0].outputs[0].text)TGI (Text Generation Inference)
# Serve fine-tuned model
docker run --gpus all -p 8080:80 ghcr.io/huggingface/text-generation-inference:latest \
--model-id /models/my-fine-tuned-llama \
--num-shard 4import requests
response = requests.post(
"http://localhost:8080/generate",
json={
"inputs": "What is fine-tuning?",
"parameters": {"max_new_tokens": 256, "temperature": 0.7},
}
)
print(response.json())Ollama — Local Deployment
# Create a Modelfile for fine-tuned model
FROM ./my-fine-tuned-model
TEMPLATE "[INST] {{ .Prompt }} [/INST]"
PARAMETER temperature 0.7
PARAMETER num_ctx 4096
# Build and run
ollama create my-model -f Modelfile
ollama run my-modelOverfitting Prevention
| Technique | Description |
|---|---|
| Validation split | Hold out 5-10% of training data. Stop if val loss increases |
| Weight decay | L2 regularization penalizes large weights |
| Dropout | Randomly disable neurons during training |
| Early stopping | Stop training when validation loss plateaus for N steps |
| Data augmentation | Paraphrase, back-translate, or add noise to training examples |
| Learning rate scheduling | Warmup + cosine decay |
Cost Considerations
# Estimate fine-tuning cost
model_params = 7 # billion
train_tokens = 50_000_000 # 50M tokens
gpu_hour_cost = 2.50 # A100 80GB on cloud
# Full fine-tuning (estimated)
full_memory_gb = model_params * 6 # ~42 GB for 7B
full_time_hours = (train_tokens / 1_000_000) * 0.5 # ~500K tokens/hour/GPU on A100
full_gpus_needed = 4 # estimate
full_cost = full_time_hours * full_gpus_needed * gpu_hour_cost
print(f"Full FT cost: ${full_cost:.2f}")
# LoRA (estimated)
lora_memory_gb = model_params * 1.5 # ~10.5 GB for 7B QLoRA
lora_time_hours = (train_tokens / 1_000_000) * 0.25 # 2x faster
lora_gpus_needed = 1
lora_cost = lora_time_hours * lora_gpus_needed * gpu_hour_cost
print(f"LoRA cost: ${lora_cost:.2f}")Expected output: Full FT for 7B on 50M tokens costs significantly more ($200+) than LoRA ($25) for comparable quality.
Common Errors
1. Out of memory (CUDA OOM)
Reduce batch size, enable gradient checkpointing, use QLoRA, or reduce sequence length. Start with per_device_train_batch_size=1.
2. Loss not decreasing after first 100 steps
Check: learning rate too high (start at 2e-4 for LoRA), dataset is empty or all padding, model is frozen, or tokenizer mismatch.
3. Model generates only EOS tokens
The model learned to output nothing. Caused by: empty responses in training data, wrong tokenizer padding direction, or excessive dropout.
4. Catastrophic forgetting
The model forgot its original capabilities. Mitigation: use lower learning rate, add 10-20% general domain data to the training mix, use LoRA with lower rank.
5. Evaluation loss increases while training loss decreases
The model is overfitting. Add dropout, increase weight decay, reduce epochs, or add more training data.
6. Adapter not loading during inference
You trained with peft but forgot to merge or specify the adapter path. Load with PeftModel.from_pretrained(base_model, adapter_path).
7. Tokenizer alignment errors
The tokenizer’s vocabulary doesn’t match the model. Always use the tokenizer that came with the base model. Mistral requires MistralTokenizer, Llama requires LlamaTokenizer.
Practice Questions
What is the main advantage of LoRA over full fine-tuning? LoRA trains only 0.1-1% of parameters, using 4-8x less memory and running 2-3x faster while retaining 98%+ of full fine-tuning quality.
How does QLoRA differ from LoRA? QLoRA adds 4-bit quantization (NF4) on top of LoRA, allowing a 7B model to fit on a single 24GB consumer GPU with minimal quality loss.
What is the recommended dataset size for fine-tuning? 500-5000 high-quality examples for most tasks. More data helps but only if quality is maintained — 1000 good examples beat 100,000 noisy ones.
How do you know when to stop training? Monitor validation loss. Stop when it plateaus or starts increasing (early stopping). Training loss continues to decrease but validation loss diverging = overfitting.
What is catastrophic forgetting and how do you prevent it? The model loses previously learned capabilities. Prevent by: mixing 10-20% general data with your domain data, using lower learning rates, and choosing LoRA over full fine-tuning.
Challenge: Fine-tune a 7B model on a domain-specific task (e.g., legal contract analysis, medical QA, or code generation). Design: (1) dataset collection strategy (500+ examples), (2) quality filtering and deduplication pipeline, (3) LoRA configuration (rank, target modules, alpha), (4) training with validation split and early stopping, (5) evaluation using BLEU/ROUGE and human review, (6) deployment with vLLM or TGI.
FAQ
Try It Yourself
Simulate a LoRA fine-tuning loop with Python:
import numpy as np
# Simulate training metrics for LoRA vs full fine-tuning
epochs = 5
lora_loss = [3.2, 2.1, 1.8, 1.7, 1.65]
full_loss = [3.1, 1.9, 1.5, 1.3, 1.25]
print("Training Loss Comparison")
print("Epoch | LoRA | Full FT | Diff")
print("-" * 40)
for i in range(epochs):
diff = full_loss[i] - lora_loss[i]
print(f"{i+1:5} | {lora_loss[i]:.4f} | {full_loss[i]:.4f} | {diff:+.4f}")
# Simulate memory comparison
lora_memory_gb = 10.5
full_memory_gb = 42.0
print(f"\nMemory (7B model):")
print(f" LoRA QLoRA: {lora_memory_gb} GB ✓ Consumer GPU")
print(f" Full FT: {full_memory_gb} GB ✗ Needs multi-GPU")Expected output:
Training Loss Comparison
Epoch | LoRA | Full FT | Diff
----------------------------------------
1 | 3.2000 | 3.1000 | -0.1000
2 | 2.1000 | 1.9000 | -0.2000
3 | 1.8000 | 1.5000 | -0.3000
4 | 1.7000 | 1.3000 | -0.4000
5 | 1.6500 | 1.2500 | -0.4000
Memory (7B model):
LoRA QLoRA: 10.5 GB ✓ Consumer GPU
Full FT: 42.0 GB ✗ Needs multi-GPUWhat’s Next
| Tutorial | What You’ll Learn |
|---|---|
| Vector Databases Guide | RAG pipelines with vector stores |
| AI Agents Guide | Building agents with fine-tuned models |
| OpenAI API Guide | Compare fine-tuning vs API-based approaches |
Built by the developers of Doda Browser, DodaZIP, and Durga Antivirus Pro. Updated 2026-06-20.
Built by the developers of DodaTech
Doda Browser, DodaZIP & Durga Antivirus Pro