Transfer Learning with Pre-trained Models — Complete Guide

DodaTech 3 min read

In this tutorial, you'll learn about Transfer Learning with Pre. We cover key concepts, practical examples, and best practices to help you understand and apply this topic effectively.

Transfer learning is a Machine Learning technique where a model trained on one task is reused as the starting point for a different but related task, dramatically reducing the amount of data and compute needed.

What You'll Learn

How to take pre-trained models like ResNet for images or BERT for text and fine-tune them on your own dataset with just a few lines of Python.

Why It Matters

Training a deep neural network from scratch requires massive datasets and days of GPU time. Transfer learning lets you achieve competitive results with as few as 100 labeled examples and minutes of training.

Real-World Use

Medical imaging startups use transfer learning to adapt models trained on ImageNet (1.4M generic images) to detect specific diseases from X-rays with only a few hundred patient scans.

How Transfer Learning Works

flowchart LR
    A[Large Dataset<br/>ImageNet / Wikipedia] --> B[Pre-trained Model<br/>ResNet / BERT]
    B --> C[Feature Extractor]
    C --> D[New Dataset<br/>Your Domain]
    D --> E[Fine-tuned Model]
    E --> F[Your Task]
    B --> G[Freeze Early Layers]
    G --> H[Replace Classifier Head]
    H --> D

Two Transfer Learning Strategies

Feature Extraction

Freeze the pre-trained model's convolutional BASE and use it as a fixed feature extractor. Only the new classifier head is trained on your data.

import tensorflow as tf
from tensorflow.keras.applications import ResNet50
from tensorflow.keras.layers import Dense, GlobalAveragePooling2D
from tensorflow.keras.models import Model

base_model = ResNet50(weights='imagenet', include_top=False, input_shape=(224, 224, 3))
base_model.trainable = False

x = base_model.output
x = GlobalAveragePooling2D()(x)
x = Dense(256, activation='relu')(x)
predictions = Dense(10, activation='softmax')(x)

model = Model(inputs=base_model.input, outputs=predictions)
model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])

model.summary()

Expected output (last lines):

Total params: 23,587,306
Trainable params: 261,898
Non-trainable params: 23,325,408

Only 261K of 23M parameters are trainable. The REST stay frozen with ImageNet weights.

Fine-Tuning

Unfreeze some of the top layers of the pre-trained model and train jointly with the new classifier. This gives better results when your dataset is reasonably large.

BASE_model.trainable = True

fine_tune_at = 100
for layer in BASE_model.layers[:fine_tune_at]:
    layer.trainable = False

model.compile(
    optimizer=tf.Keras.optimizers.Adam(learning_rate=0.0001),
    loss='categorical_crossentropy',
    metrics=['accuracy']
)

history = model.fit(
    train_generator,
    steps_per_epoch=100,
    epochs=10,
    validation_data=val_generator
)

Expected output (simplified):

Epoch 1/10  loss: 0.4231  accuracy: 0.8452  val_loss: 0.3124  val_accuracy: 0.8912
Epoch 5/10  loss: 0.1823  accuracy: 0.9387  val_loss: 0.2015  val_accuracy: 0.9210
Epoch 10/10 loss: 0.0876  accuracy: 0.9721  val_loss: 0.1542  val_accuracy: 0.9423

Accuracy improves steadily as the model adapts pre-trained features to your new task.

Transfer Learning for NLP with BERT

from transformers import BertTokenizer, TFBertForSequenceClassification
import TensorFlow as tf

tokenizer = BertTokenizer.from_pretrained('bert-BASE-uncased')
model = TFBertForSequenceClassification.from_pretrained('bert-BASE-uncased', num_labels=2)

texts = ["I loved this movie", "Terrible product, would not recommend"]
encoded = tokenizer(texts, padding=True, truncation=True, return_tensors='tf')

outputs = model(encoded)
predictions = tf.nn.softmax(outputs.logits, axis=-1)
print(predictions.numpy())

Expected output:

[[0.023 0.977]
 [0.964 0.036]]

First text is 97.7% positive, second is 96.4% negative. This is a pre-trained BERT that was fine-tuned on IMDb reviews.

Common Mistakes

Not freezing the BASE first — if you train the whole model from the start, the pre-trained weights get destroyed before they can help
Learning rate too high — use 10x lower LR (1e-4 to 1e-5) than you would for a scratch model
Too few fine-tuning layers — unfreezing too aggressively with a small dataset leads to overfitting

Practice Questions

What is the key difference between feature extraction and fine-tuning in transfer learning?
Why should you freeze the pre-trained BASE before training the classifier head?
When would fine-tuning outperform feature extraction?

Frequently Asked Questions

Do I need a GPU for transfer learning?

Feature extraction works fine on CPU for most pre-trained models since only the classifier head is trained. Fine-tuning benefits from a GPU but is still feasible on CPU for small datasets and models like MobileNet.

Can I use transfer learning with tabular data?

Transfer learning is most effective for images and text where large pre-trained models exist. For tabular data, consider using pre-trained embeddings or multi-task learning instead.

What is the minimum dataset size for transfer learning?

With feature extraction you can get reasonable results with as few as 50-100 examples per class. Fine-tuning typically requires 500-1000 examples per class to avoid overfitting.