Transfer Learning with Pre-trained Models — Complete Guide
In this tutorial, you'll learn about Transfer Learning with Pre. We cover key concepts, practical examples, and best practices to help you understand and apply this topic effectively.
Transfer learning is a Machine Learning technique where a model trained on one task is reused as the starting point for a different but related task, dramatically reducing the amount of data and compute needed.
What You'll Learn
How to take pre-trained models like ResNet for images or BERT for text and fine-tune them on your own dataset with just a few lines of Python.
Why It Matters
Training a deep neural network from scratch requires massive datasets and days of GPU time. Transfer learning lets you achieve competitive results with as few as 100 labeled examples and minutes of training.
Real-World Use
Medical imaging startups use transfer learning to adapt models trained on ImageNet (1.4M generic images) to detect specific diseases from X-rays with only a few hundred patient scans.
How Transfer Learning Works
flowchart LR
A[Large Dataset<br/>ImageNet / Wikipedia] --> B[Pre-trained Model<br/>ResNet / BERT]
B --> C[Feature Extractor]
C --> D[New Dataset<br/>Your Domain]
D --> E[Fine-tuned Model]
E --> F[Your Task]
B --> G[Freeze Early Layers]
G --> H[Replace Classifier Head]
H --> D
Two Transfer Learning Strategies
Feature Extraction
Freeze the pre-trained model's convolutional BASE and use it as a fixed feature extractor. Only the new classifier head is trained on your data.
import tensorflow as tf
from tensorflow.keras.applications import ResNet50
from tensorflow.keras.layers import Dense, GlobalAveragePooling2D
from tensorflow.keras.models import Model
base_model = ResNet50(weights='imagenet', include_top=False, input_shape=(224, 224, 3))
base_model.trainable = False
x = base_model.output
x = GlobalAveragePooling2D()(x)
x = Dense(256, activation='relu')(x)
predictions = Dense(10, activation='softmax')(x)
model = Model(inputs=base_model.input, outputs=predictions)
model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])
model.summary()
Expected output (last lines):
Total params: 23,587,306
Trainable params: 261,898
Non-trainable params: 23,325,408
Only 261K of 23M parameters are trainable. The REST stay frozen with ImageNet weights.
Fine-Tuning
Unfreeze some of the top layers of the pre-trained model and train jointly with the new classifier. This gives better results when your dataset is reasonably large.
BASE_model.trainable = True
fine_tune_at = 100
for layer in BASE_model.layers[:fine_tune_at]:
layer.trainable = False
model.compile(
optimizer=tf.Keras.optimizers.Adam(learning_rate=0.0001),
loss='categorical_crossentropy',
metrics=['accuracy']
)
history = model.fit(
train_generator,
steps_per_epoch=100,
epochs=10,
validation_data=val_generator
)
Expected output (simplified):
Epoch 1/10 loss: 0.4231 accuracy: 0.8452 val_loss: 0.3124 val_accuracy: 0.8912
Epoch 5/10 loss: 0.1823 accuracy: 0.9387 val_loss: 0.2015 val_accuracy: 0.9210
Epoch 10/10 loss: 0.0876 accuracy: 0.9721 val_loss: 0.1542 val_accuracy: 0.9423
Accuracy improves steadily as the model adapts pre-trained features to your new task.
Transfer Learning for NLP with BERT
from transformers import BertTokenizer, TFBertForSequenceClassification
import TensorFlow as tf
tokenizer = BertTokenizer.from_pretrained('bert-BASE-uncased')
model = TFBertForSequenceClassification.from_pretrained('bert-BASE-uncased', num_labels=2)
texts = ["I loved this movie", "Terrible product, would not recommend"]
encoded = tokenizer(texts, padding=True, truncation=True, return_tensors='tf')
outputs = model(encoded)
predictions = tf.nn.softmax(outputs.logits, axis=-1)
print(predictions.numpy())
Expected output:
[[0.023 0.977]
[0.964 0.036]]
First text is 97.7% positive, second is 96.4% negative. This is a pre-trained BERT that was fine-tuned on IMDb reviews.
Common Mistakes
- Not freezing the BASE first — if you train the whole model from the start, the pre-trained weights get destroyed before they can help
- Learning rate too high — use 10x lower LR (1e-4 to 1e-5) than you would for a scratch model
- Too few fine-tuning layers — unfreezing too aggressively with a small dataset leads to overfitting
Practice Questions
- What is the key difference between feature extraction and fine-tuning in transfer learning?
- Why should you freeze the pre-trained BASE before training the classifier head?
- When would fine-tuning outperform feature extraction?
Frequently Asked Questions
Related Topics
- Python — essential for running the code
- PyTorch Beginners Guide — alternative to TensorFlow
- Neural Networks from Scratch — understand what you're fine-tuning
Built by the developers of Doda Browser, DodaZIP, and Durga Antivirus Pro.
Built by the developers of DodaTech
Doda Browser, DodaZIP & Durga Antivirus Pro