Learn Fine-Tuning LLMs — Full Fine-Tuning vs PEFT, LoRA, Dataset Prep, Training Frameworks & Deployment

Q: How much does it cost to fine-tune an LLM?

LoRA fine-tuning a 7B model on 50M tokens costs ~$25 on a single A100. Full fine-tuning the same model costs ~$200-500. Llama 3 70B full fine-tuning can exceed $10,000.

Q: Can I fine-tune GPT-4?

OpenAI offers fine-tuning for GPT-4o and GPT-3.5. You provide training data (JSONL), OpenAI trains it on their infrastructure. Cost: ~$25/1M training tokens + $50/1M inference tokens.

Q: What is the difference between fine-tuning and RAG?

Fine-tuning changes the model’s weights — the knowledge becomes part of the model. RAG retrieves external documents at inference time — the model’s weights don’t change. RAG is cheaper and easier to update. Fine-tuning provides lower latency and works offline.

Q: Do I need a GPU to fine-tune?

Yes — fine-tuning requires GPU. LoRA/QLoRA makes it accessible on consumer GPUs (RTX 3090/4090 with 24GB). Cloud options: Colab Pro ($10/month), RunPod ($0.34/hr), Lambda Labs, AWS.

Q: How long does fine-tuning take?

LoRA on a 7B model with 1000 examples at 2048 sequence length: 1-3 hours on 1x A100. Full fine-tuning: 8-24 hours on 4x A100. QLoRA on RTX 4090: 3-6 hours.

Q: Can I fine-tune a model for a non-English language?

Yes — fine-tuning is excellent for adapting models to new languages. You need 1000+ high-quality examples in the target language. LoRA works well for this, with typical quality gains of 20-40% on language-specific tasks.

AI Frameworks & APIs

Fine-Tuning LLMs — Full Fine-Tuning vs PEFT, LoRA, Dataset Prep, Training Frameworks & Deployment

DodaTech Updated Jun 20, 2026 11 min read

Fine-tuning adapts a pre-trained large language model (LLM) to your specific domain or task — instead of prompting GPT-4 for every answer, you train a smaller model on your data so it internalizes your domain knowledge, formatting preferences, and behavior patterns.

What You’ll Learn

Full fine-tuning vs. Parameter-Efficient Fine-Tuning (PEFT) — trade-offs and when to use each
LoRA and QLoRA for low-memory fine-tuning
Dataset preparation: formatting, quality filtering, deduplication, and augmentation
Training frameworks: Axolotl, Unsloth, and HuggingFace TRL
Evaluation metrics: perplexity, BLEU, ROUGE, and human evaluation
Deployment with vLLM, TGI, and Ollama
Overfitting prevention and cost considerations

Why Fine-Tuning Matters

Base models are generalists — they know a bit about everything. Fine-tuning turns them into specialists. A fine-tuned model on legal documents outperforms GPT-4 on contract analysis at a fraction of the cost. Fine-tuned code models complete your company’s API patterns correctly. Every enterprise AI strategy includes fine-tuning as the bridge between general-purpose models and domain-specific needs.

DodaZIP uses fine-tuned models to recognize compression patterns specific to proprietary file formats. Durga Antivirus Pro fine-tunes threat detection models on new malware families weekly, adapting faster than signature-based detection.

Learning Path

    flowchart LR
  A["OpenAI API Guide"] --> B["Vector Databases"]
  B --> C["Fine-Tuning LLMs<br/>You are here"]
  C --> D["PEFT Methods"]
  C --> E["Deployment"]
  D --> F["AI Agents"]
  style C fill:#f90,color:#fff

Full Fine-Tuning vs. PEFT

Aspect	Full Fine-Tuning	PEFT (LoRA, QLoRA)
Parameters updated	All	0.1-1% (adapter modules)
Memory required	Very high (4-8x model size)	Low (1.2-2x model size)
Training time	Days to weeks	Hours to days
Output	Full model weights	Small adapter file (MB size)
Quality	Highest possible	Near full fine-tuning quality
Use case	Large compute budget, maximum accuracy	Limited GPU, rapid iteration

When to Choose Each

Full fine-tuning: You have 8+ GPUs (A100 80GB), need maximum accuracy, and have weeks of training time.
PEFT/LoRA: You have 1-4 GPUs (consumer or cloud), need fast iteration, and can accept <2% quality gap.
QLoRA: You have a single consumer GPU (RTX 3090, 4090) with 24GB VRAM and want to fine-tune 7B-13B models.

LoRA — Low-Rank Adaptation

LoRA injects trainable rank-decomposition matrices into the transformer layers. Instead of updating the full weight matrix W (huge), it trains two smaller matrices A and B.

Original:  h = Wx          (W has d×k parameters)
LoRA:      h = Wx + BAx    (B has d×r, A has r×k — r is rank, typically 8-64)

Parameters saved: d×k → 2×d×r (when r << min(d,k))
For Llama 2 7B: full update = 7B params, LoRA update ≈ 4.2M params (0.06%)

from peft import LoraConfig, get_peft_model
from transformers import AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-7b-hf")

lora_config = LoraConfig(
    r=16,                 # Rank
    lora_alpha=32,        # Scaling factor
    target_modules=["q_proj", "v_proj", "k_proj", "o_proj"],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM",
)

peft_model = get_peft_model(model, lora_config)
print(f"Trainable params: {peft_model.num_parameters(only_trainable=True):,}")
print(f"Total params: {peft_model.num_parameters():,}")

Expected output:

Trainable params: 4,194,304
Total params: 6,738,415,616

QLoRA — Quantized LoRA

QLoRA quantizes the base model to 4-bit (NF4 format) and adds LoRA adapters on top. This allows fine-tuning a 7B model on a 24GB GPU.

from transformers import BitsAndBytesConfig

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype="float16",
    bnb_4bit_use_double_quant=True,
)

model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-2-7b-hf",
    quantization_config=bnb_config,
    device_map="auto",
)
# Then apply LoRA config as above — now 7B model fits in 24GB

Dataset Preparation

Quality matters more than quantity. 1000 high-quality examples beat 100,000 noisy ones.

Format Types

# Instruction format
{"instruction": "Explain what a vector database is.",
 "output": "A vector database stores embeddings for similarity search."}

# Chat format
{"messages": [
  {"role": "system", "content": "You are a helpful assistant."},
  {"role": "user", "content": "What is LoRA?"},
  {"role": "assistant", "content": "LoRA is a parameter-efficient fine-tuning method..."}
]}

# Completion format
{"prompt": "Q: What is fine-tuning?\nA:",
 "completion": " Fine-tuning adapts a pre-trained model to a specific task."}

Quality Filtering Steps

# Dataset quality pipeline
def filter_dataset(records):
    cleaned = []
    for rec in records:
        text = rec.get("output", rec.get("completion", ""))
        
        # Remove empty outputs
        if len(text.strip()) < 10:
            continue
        
        # Remove with excessive repetition
        if len(set(text.split())) / len(text.split()) < 0.3:
            continue
        
        # Remove toxic content
        if any(word in text.lower() for word in ["profanity_list"]):
            continue
            
        # Remove near-duplicates (cosine similarity > 0.95)
        cleaned.append(rec)
    
    return cleaned

Deduplication

from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

model = SentenceTransformer("all-MiniLM-L6-v2")

def deduplicate(records, threshold=0.95):
    texts = [r.get("output", r.get("completion", "")) for r in records]
    embeddings = model.encode(texts)
    
    keep = []
    for i in range(len(records)):
        if not keep:
            keep.append(i)
            continue
        sims = cosine_similarity([embeddings[i]], embeddings[keep])
        if sims.max() < threshold:
            keep.append(i)
    
    return [records[i] for i in keep]

print(f"Before dedup: 10000 records")
print(f"After dedup: {len(deduplicate([{'output': 'test'}] * 10000, threshold=0.95))} records")

Expected output: Deduplication removes near-identical records, typically reducing a raw dataset by 10-30%.

Training Frameworks

Axolotl

Configuration-driven fine-tuning — write a YAML config, run one command.

# axolotl_config.yml
base_model: mistralai/Mistral-7B-v0.1
model_type: MistralForCausalLM
tokenizer_type: LlamaTokenizer

load_in_8bit: false
load_in_4bit: true
strict: false

datasets:
  - path: my_dataset.jsonl
    type: sharegpt
    conversation: llama2

dataset_prepared_path: last_run_prepared
val_set_size: 0.05
output_dir: ./lora-out

sequence_len: 2048
sample_packing: true

lora_r: 16
lora_alpha: 32
lora_dropout: 0.05
lora_target_modules:
  - q_proj
  - v_proj
  - k_proj
  - o_proj

train_on_inputs: false
group_by_length: false
gradient_accumulation_steps: 4
micro_batch_size: 4
num_epochs: 3
optimizer: adamw_bnb_8bit
lr_scheduler: cosine
learning_rate: 2e-4

wandb_project: my-fine-tune
wandb_watch: gradients

bf16: auto
fp16: false

accelerate launch -m axolotl.cli.train axolotl_config.yml

Unsloth

Optimized LoRA/QLoRA with 2x faster training and 50% less memory.

from unsloth import FastLanguageModel

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="unsloth/mistral-7b-bnb-4bit",
    max_seq_length=2048,
    dtype=None,
    load_in_4bit=True,
)

model = FastLanguageModel.get_peft_model(
    model,
    r=16,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
    lora_alpha=16,
    lora_dropout=0,
    bias="none",
    use_gradient_checkpointing="unsloth",
)

# Train with standard HuggingFace Trainer
from transformers import TrainingArguments
trainer = Trainer(
    model=model,
    train_dataset=dataset,
    args=TrainingArguments(
        per_device_train_batch_size=4,
        gradient_accumulation_steps=4,
        num_train_epochs=3,
        learning_rate=2e-4,
        fp16=not torch.cuda.is_bf16_supported(),
        logging_steps=10,
        output_dir="outputs",
    ),
)
trainer.train()

HuggingFace TRL

Transformer Reinforcement Learning — for RLHF and DPO.

from trl import SFTTrainer

trainer = SFTTrainer(
    model=model,
    train_dataset=dataset,
    dataset_text_field="text",
    max_seq_length=2048,
    args=TrainingArguments(
        per_device_train_batch_size=4,
        gradient_accumulation_steps=4,
        learning_rate=2e-4,
        num_train_epochs=3,
        output_dir="sft-output",
    ),
)
trainer.train()

Evaluation Metrics

Metric	What It Measures	Range	Interpretation
Perplexity	Model confidence on held-out data	1-∞	Lower is better. GPT-4: ~10, Llama 2 7B: ~8
BLEU	N-gram overlap with reference	0-100	Higher is better. Best for translation
ROUGE-L	Longest common subsequence	0-100	Higher is better. Best for summarization
Human Eval	Expert rating of outputs	1-5	Gold standard but expensive

from evaluate import load

# Load metrics
perplexity = load("perplexity", module_type="metric")
bleu = load("bleu")
rouge = load("rouge")

# Example: evaluate generated text
generated = "Fine-tuning adapts models to specific tasks."
reference = "Fine-tuning adapts pre-trained models to domain-specific tasks."

results = rouge.compute(
    predictions=[generated],
    references=[reference]
)
print(f"ROUGE-L: {results['rougeL']:.4f}")

Expected output:

ROUGE-L: 0.7857

Deployment

vLLM — High-Throughput Serving

from vllm import LLM, SamplingParams

# Load fine-tuned model with LoRA adapter
llm = LLM(
    model="mistralai/Mistral-7B-v0.1",
    enable_lora=True,
    max_lora_rank=64,
)

sampling_params = SamplingParams(temperature=0.7, max_tokens=256)

# Apply LoRA adapter at inference time
outputs = llm.generate(
    ["Explain fine-tuning in one sentence."],
    sampling_params,
    lora_request=None,  # Specify LoRA adapter path
)
print(outputs[0].outputs[0].text)

TGI (Text Generation Inference)

# Serve fine-tuned model
docker run --gpus all -p 8080:80 ghcr.io/huggingface/text-generation-inference:latest \
  --model-id /models/my-fine-tuned-llama \
  --num-shard 4

import requests

response = requests.post(
    "http://localhost:8080/generate",
    json={
        "inputs": "What is fine-tuning?",
        "parameters": {"max_new_tokens": 256, "temperature": 0.7},
    }
)
print(response.json())

Ollama — Local Deployment

# Create a Modelfile for fine-tuned model
FROM ./my-fine-tuned-model
TEMPLATE "[INST] {{ .Prompt }} [/INST]"
PARAMETER temperature 0.7
PARAMETER num_ctx 4096

# Build and run
ollama create my-model -f Modelfile
ollama run my-model

Overfitting Prevention

Technique	Description
Validation split	Hold out 5-10% of training data. Stop if val loss increases
Weight decay	L2 regularization penalizes large weights
Dropout	Randomly disable neurons during training
Early stopping	Stop training when validation loss plateaus for N steps
Data augmentation	Paraphrase, back-translate, or add noise to training examples
Learning rate scheduling	Warmup + cosine decay

Cost Considerations

# Estimate fine-tuning cost
model_params = 7  # billion
train_tokens = 50_000_000  # 50M tokens
gpu_hour_cost = 2.50  # A100 80GB on cloud

# Full fine-tuning (estimated)
full_memory_gb = model_params * 6  # ~42 GB for 7B
full_time_hours = (train_tokens / 1_000_000) * 0.5  # ~500K tokens/hour/GPU on A100
full_gpus_needed = 4  # estimate
full_cost = full_time_hours * full_gpus_needed * gpu_hour_cost
print(f"Full FT cost: ${full_cost:.2f}")

# LoRA (estimated)
lora_memory_gb = model_params * 1.5  # ~10.5 GB for 7B QLoRA
lora_time_hours = (train_tokens / 1_000_000) * 0.25  # 2x faster
lora_gpus_needed = 1
lora_cost = lora_time_hours * lora_gpus_needed * gpu_hour_cost
print(f"LoRA cost: ${lora_cost:.2f}")

Expected output: Full FT for 7B on 50M tokens costs significantly more ($200+) than LoRA ($25) for comparable quality.

Common Errors

1. Out of memory (CUDA OOM)

Reduce batch size, enable gradient checkpointing, use QLoRA, or reduce sequence length. Start with per_device_train_batch_size=1.

2. Loss not decreasing after first 100 steps

Check: learning rate too high (start at 2e-4 for LoRA), dataset is empty or all padding, model is frozen, or tokenizer mismatch.

3. Model generates only EOS tokens

The model learned to output nothing. Caused by: empty responses in training data, wrong tokenizer padding direction, or excessive dropout.

4. Catastrophic forgetting

The model forgot its original capabilities. Mitigation: use lower learning rate, add 10-20% general domain data to the training mix, use LoRA with lower rank.

5. Evaluation loss increases while training loss decreases

The model is overfitting. Add dropout, increase weight decay, reduce epochs, or add more training data.

6. Adapter not loading during inference

You trained with peft but forgot to merge or specify the adapter path. Load with PeftModel.from_pretrained(base_model, adapter_path).

7. Tokenizer alignment errors

The tokenizer’s vocabulary doesn’t match the model. Always use the tokenizer that came with the base model. Mistral requires MistralTokenizer, Llama requires LlamaTokenizer.

Practice Questions

What is the main advantage of LoRA over full fine-tuning? LoRA trains only 0.1-1% of parameters, using 4-8x less memory and running 2-3x faster while retaining 98%+ of full fine-tuning quality.
How does QLoRA differ from LoRA? QLoRA adds 4-bit quantization (NF4) on top of LoRA, allowing a 7B model to fit on a single 24GB consumer GPU with minimal quality loss.
What is the recommended dataset size for fine-tuning? 500-5000 high-quality examples for most tasks. More data helps but only if quality is maintained — 1000 good examples beat 100,000 noisy ones.
How do you know when to stop training? Monitor validation loss. Stop when it plateaus or starts increasing (early stopping). Training loss continues to decrease but validation loss diverging = overfitting.
What is catastrophic forgetting and how do you prevent it? The model loses previously learned capabilities. Prevent by: mixing 10-20% general data with your domain data, using lower learning rates, and choosing LoRA over full fine-tuning.

Challenge: Fine-tune a 7B model on a domain-specific task (e.g., legal contract analysis, medical QA, or code generation). Design: (1) dataset collection strategy (500+ examples), (2) quality filtering and deduplication pipeline, (3) LoRA configuration (rank, target modules, alpha), (4) training with validation split and early stopping, (5) evaluation using BLEU/ROUGE and human review, (6) deployment with vLLM or TGI.

FAQ

How much does it cost to fine-tune an LLM?

LoRA fine-tuning a 7B model on 50M tokens costs ~$25 on a single A100. Full fine-tuning the same model costs ~$200-500. Llama 3 70B full fine-tuning can exceed $10,000.

Can I fine-tune GPT-4?

OpenAI offers fine-tuning for GPT-4o and GPT-3.5. You provide training data (JSONL), OpenAI trains it on their infrastructure. Cost: ~$25/1M training tokens + $50/1M inference tokens.

What is the difference between fine-tuning and RAG?

Fine-tuning changes the model’s weights — the knowledge becomes part of the model. RAG retrieves external documents at inference time — the model’s weights don’t change. RAG is cheaper and easier to update. Fine-tuning provides lower latency and works offline.

Do I need a GPU to fine-tune?

Yes — fine-tuning requires GPU. LoRA/QLoRA makes it accessible on consumer GPUs (RTX 3090/4090 with 24GB). Cloud options: Colab Pro ($10/month), RunPod ($0.34/hr), Lambda Labs, AWS.

How long does fine-tuning take?

LoRA on a 7B model with 1000 examples at 2048 sequence length: 1-3 hours on 1x A100. Full fine-tuning: 8-24 hours on 4x A100. QLoRA on RTX 4090: 3-6 hours.

Can I fine-tune a model for a non-English language?

Yes — fine-tuning is excellent for adapting models to new languages. You need 1000+ high-quality examples in the target language. LoRA works well for this, with typical quality gains of 20-40% on language-specific tasks.

Try It Yourself

Simulate a LoRA fine-tuning loop with Python:

import numpy as np

# Simulate training metrics for LoRA vs full fine-tuning
epochs = 5
lora_loss = [3.2, 2.1, 1.8, 1.7, 1.65]
full_loss = [3.1, 1.9, 1.5, 1.3, 1.25]

print("Training Loss Comparison")
print("Epoch | LoRA     | Full FT  | Diff")
print("-" * 40)
for i in range(epochs):
    diff = full_loss[i] - lora_loss[i]
    print(f"{i+1:5} | {lora_loss[i]:.4f} | {full_loss[i]:.4f} | {diff:+.4f}")

# Simulate memory comparison
lora_memory_gb = 10.5
full_memory_gb = 42.0
print(f"\nMemory (7B model):")
print(f"  LoRA QLoRA:  {lora_memory_gb} GB  ✓ Consumer GPU")
print(f"  Full FT:     {full_memory_gb} GB  ✗ Needs multi-GPU")

Expected output:

Training Loss Comparison
Epoch | LoRA     | Full FT  | Diff
----------------------------------------
    1 | 3.2000 | 3.1000 | -0.1000
    2 | 2.1000 | 1.9000 | -0.2000
    3 | 1.8000 | 1.5000 | -0.3000
    4 | 1.7000 | 1.3000 | -0.4000
    5 | 1.6500 | 1.2500 | -0.4000

Memory (7B model):
  LoRA QLoRA:  10.5 GB  ✓ Consumer GPU
  Full FT:     42.0 GB  ✗ Needs multi-GPU

What’s Next

Tutorial	What You’ll Learn
Vector Databases Guide	RAG pipelines with vector stores
AI Agents Guide	Building agents with fine-tuned models
OpenAI API Guide	Compare fine-tuning vs API-based approaches

Built by the developers of Doda Browser, DodaZIP, and Durga Antivirus Pro. Updated 2026-06-20.

Previous Vector Databases — Embeddings, Similarity Search, Indexing & RAG Pipelines Next AI Agents — Architecture, LangGraph, Multi-Agent Systems, Tool Use, Planning, Memory & Production Deployment

Built by the developers of DodaTech

Doda Browser, DodaZIP & Durga Antivirus Pro

Home Browse AI Frameworks & APIs