Cohort Analysis for User Retention -- Complete Measurement & Optimization Guide

DodaTech Updated 2026-06-23 7 min read

In this tutorial, you'll learn about Cohort Analysis for User Retention. We cover key concepts, practical examples, and best practices to help you understand and apply this topic effectively.

Cohort analysis groups users by a shared characteristic or time period and tracks their behavior over time, revealing retention patterns that aggregate metrics like total active users completely hide.

What You'll Learn

In this tutorial, you will learn how to build weekly and monthly cohort retention queries in SQL, create retention heatmaps in Python, distinguish between time-based and behavior-based cohorts, calculate churn rates with proper denominators, and use cohort insights to drive product improvements that increase long-term engagement.

Why It Matters

Aggregate metrics like "50,000 monthly active users" hide critical retention problems. A product can gain 10,000 new users while losing 9,800 existing ones each month. Cohort analysis reveals this hidden churn by tracking specific groups over their lifetime. Companies that actively use cohort analysis improve D30 retention by 15-25% by identifying exactly when and why users stop engaging.

Real-World Use

DodaZIP's product team ran a weekly cohort analysis comparing users who completed onboarding in Q1 2026 versus Q4 2025. The 2026 cohort showed 27% lower 14-day retention. Investigation revealed that a redesigned onboarding step added an unnecessary file selection dialog that caused confusion. Rolling back that step recovered retention to previous levels within 10 days, directly saving an estimated 3,200 users from churning monthly.

Cohort Analysis Flow

flowchart LR
    A[User Acquisition] --> B[Assign to Time Cohort]
    B --> C[Track Activity Periods]
    C --> D[Period 0: Signup Week]
    C --> E[Period 1: Week 1]
    C --> F[Period 2: Week 2]
    C --> G[Period N: Week N]
    D --> H[Build Retention Matrix]
    E --> H
    F --> H
    G --> H
    H --> I[Identify Churn Patterns]
    H --> J[Segment by Behavior]
    H --> K[Alert on Declines]

SQL Weekly Cohort Retention

Build a weekly cohort retention query using date-truncated timestamps:

WITH user_first_week AS (
  SELECT
    user_id,
    DATE_TRUNC('week', MIN(event_timestamp)) AS cohort_week
  FROM analytics_events
  WHERE event_name = 'signup_complete'
  GROUP BY user_id
),

user_weekly_activity AS (
  SELECT
    ufw.user_id,
    ufw.cohort_week,
    DATE_TRUNC('week', ae.event_timestamp) AS activity_week,
    DATE_PART('week', ae.event_timestamp) - DATE_PART('week', ufw.cohort_week)
      + (DATE_PART('year', ae.event_timestamp) - DATE_PART('year', ufw.cohort_week)) * 52
      AS period_number
  FROM analytics_events ae
  JOIN user_first_week ufw ON ae.user_id = ufw.user_id
  WHERE ae.event_name IN ('session_start', 'core_action', 'value_event')
  GROUP BY ufw.user_id, ufw.cohort_week, activity_week, period_number
),

cohort_size AS (
  SELECT
    cohort_week,
    COUNT(DISTINCT user_id) AS total_users
  FROM user_first_week
  GROUP BY cohort_week
)

SELECT
  uwa.cohort_week::DATE AS cohort_week,
  cs.total_users,
  uwa.period_number,
  COUNT(DISTINCT uwa.user_id) AS active_users,
  ROUND(COUNT(DISTINCT uwa.user_id) * 100.0 / cs.total_users, 2) AS retention_pct,
  ROUND(100.0 - (COUNT(DISTINCT uwa.user_id) * 100.0 / cs.total_users), 2) AS churn_pct
FROM user_weekly_activity uwa
JOIN cohort_size cs ON uwa.cohort_week = cs.cohort_week
GROUP BY uwa.cohort_week, cs.total_users, uwa.period_number
ORDER BY uwa.cohort_week, uwa.period_number;

Expected output: A table with each cohort week showing total users, active users per subsequent week, retention percentage, and churn percentage. Period 0 shows 100% retention. A healthy SaaS product maintains 40-60% week-1 retention and 20-30% week-4 retention. Values below these benchmarks indicate churn problems requiring product intervention.

Python Retention Heatmap

Generate a retention heatmap visualization:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from operator import attrgetter

def compute_retention_matrix(events_df, user_col="user_id",
                             date_col="event_date", cohort_col="signup_date"):
    df = events_df.copy()
    df["cohort"] = pd.to_datetime(df[cohort_col]).dt.to_period("M")
    df["activity"] = pd.to_datetime(df[date_col]).dt.to_period("M")
    df["period"] = (df["activity"] - df["cohort"]).apply(attrgetter("n"))

    cohort_user_counts = df.groupby("cohort")[user_col].nunique()

    retention_pivot = df.pivot_table(
        index="cohort",
        columns="period",
        values=user_col,
        aggfunc="nunique",
    )

    retention_matrix = retention_pivot.divide(cohort_user_counts, axis=0) * 100

    fig, ax = plt.subplots(figsize=(14, 8))
    sns.heatmap(
        retention_matrix,
        annot=True,
        fmt=".1f",
        cmap="YlOrRd",
        cbar_kws={"label": "Retention (%)"},
        linewidths=0.5,
        linecolor="#f1f5f9",
        ax=ax,
    )
    ax.set_title("Weekly Cohort Retention Heatmap", fontsize=14, fontweight="bold")
    ax.set_xlabel("Periods Since Signup (Weeks)", fontsize=11)
    ax.set_ylabel("Cohort Week", fontsize=11)
    plt.tight_layout()
    return retention_matrix, fig

np.random.seed(42)
cohorts = pd.date_range("2026-01-01", periods=16, freq="W")
data_rows = []
for i, cohort_date in enumerate(cohorts):
    users = 500 + np.random.randint(-50, 50)
    for period in range(12):
        retention = max(0, 100 * (0.7 ** period) + np.random.normal(0, 3))
        data_rows.append({
            "signup_date": cohort_date,
            "event_date": cohort_date + pd.DateOffset(weeks=period),
            "user_id": f"user_{i}_{period}",
        })

df = pd.DataFrame(data_rows)
matrix, heatmap = compute_retention_matrix(df)
print(matrix.round(1).head(8))

Expected output: A heatmap showing 16 weekly cohorts across 12 periods. Each cell shows the retention percentage. The top row (most recent cohort) should show higher retention if product improvements are working. A diagonal pattern of declining retention from left to right is normal; a vertical decline (newer cohorts performing worse) signals a serious product regression.

Behavior-Based vs Time-Based Cohorts

Compare retention between behavioral and time-based cohort definitions:

import pandas as pd
from datetime import timedelta

def behavior_cohort_analysis(users_df, actions_df, behavior_event="first_purchase"):
    first_behavior = (
        actions_df[actions_df["event_name"] == behavior_event]
        .groupby("user_id")["event_timestamp"]
        .min()
        .reset_index()
        .rename(columns={"event_timestamp": "behavior_date"})
    )

    behavior_users = users_df.Merge(first_behavior, on="user_id", how="inner")

    behavior_users["cohort_month"] = behavior_users["behavior_date"].dt.to_period("M")
    behavior_users["lifetime_days"] = (
        behavior_users["last_active_date"] - behavior_users["behavior_date"]
    ).dt.days

    stats = behavior_users.groupby("cohort_month").agg(
        total_users=("user_id", "count"),
        avg_lifetime_days=("lifetime_days", "mean"),
        retained_7d=("lifetime_days", lambda x: (x >= 7).sum()),
        retained_30d=("lifetime_days", lambda x: (x >= 30).sum()),
        retained_90d=("lifetime_days", lambda x: (x >= 90).sum()),
        retention_7d_pct=("lifetime_days", lambda x: round((x >= 7).sum() / len(x) * 100, 1)),
        retention_30d_pct=("lifetime_days", lambda x: round((x >= 30).sum() / len(x) * 100, 1)),
        retention_90d_pct=("lifetime_days", lambda x: round((x >= 90).sum() / len(x) * 100, 1)),
    ).reset_index()

    return stats

comparison = behavior_cohort_analysis(
    pd.read_csv("users.csv"),
    pd.read_csv("user_events.csv"),
    behavior_event="first_purchase",
)
print(comparison[["cohort_month", "total_users", "retention_7d_pct", "retention_30d_pct", "retention_90d_pct"]].head())

Expected output: Behavior-based cohorts show significantly higher retention at all intervals compared to time-based cohorts because coohorting by action selects users who have already demonstrated product interest. D7 retention of 60-80% for behavioral cohorts versus 20-40% for time-based is typical.

Tool Comparison

Feature	Mixpanel	Amplitude	PostHog	Custom SQL
Retention heatmap	Built-in	Built-in	Built-in	Python seaborn
Behavioral cohorts	Yes	Yes	Yes	SQL-defined
Predictive cohorts	Yes	Yes	No	ML model
Automated alerts	Yes	Yes	Limited	Cron + script
Data sources	SDK events	SDK events	SDK events	Any database
Cost per month	$25+ (growth)	$50+ (growth)	Free (self-host)	Infrastructure only

Common Errors

1. Using Calendar Periods Instead of User Age

Grouping users by calendar month (e.g., "January signups") and tracking retention by calendar month creates misleading comparisons when January has 31 days, February has 28, and different cohorts have different week structures. Use user age (periods since cohort assignment) as the X-axis instead.

2. Ignoring Cohort Size for Statistical Validity

A cohort with 8 users showing 100% retention is meaningless. Exclude cohorts with fewer than 100 users from your analysis, or apply Wilson score confidence intervals to retention percentages to indicate uncertainty.

3. Changing Denominator Mid-Analysis

Using the original cohort size as the denominator for period 0 but switching to active users for period 3 inflates retention percentages. Always use the original cohort size as the denominator for all periods.

4. Mixing Cohort Period Definitions

Comparing weekly cohort retention to monthly cohort retention creates artifacts from differing period lengths. Standardize on weekly, biweekly, or monthly cohorts across all analyses in your report.

5. Survivorship Bias in Long-Term Analyses

Users who churned early stop generating events, so they disappear from active-user counts in later periods. If you switch to "active users in this period" as the denominator, you hide the true churn rate. Always use the original cohort size as the denominator.

Practice Questions

1. What is a cohort in analytics? A cohort is a group of users who share a common characteristic or experience within a defined time period, such as "users who signed up in the week of June 1" or "users who made their first purchase in January."

2. Why use user age (period number) instead of calendar dates? User age normalizes retention tracking so users from different calendar periods are compared at the same point in their lifecycle. This eliminates calendar effects like month length variations, holidays, and seasonality.

3. What does a retention heatmap reveal that a line chart cannot? A retention heatmap displays the full retention matrix with multiple cohorts on the Y-axis and periods on the X-axis. This reveals whether newer cohorts retain better or worse than older cohorts. A line chart typically shows only one cohort at a time.

4. How does behavior-based cohorting differ from time-based cohorting? Behavior-based cohorts group users who performed a specific action (like first purchase or feature activation) rather than users who simply signed up in a time period. Behavioral cohorts show 20-40% higher retention because the cohorting action filters for engaged users.

5. Challenge: Export 6 months of user activity data from your product. Build both time-based (weekly signup) and behavior-based (first feature activation) retention matrices in Python. Compare the retention curves and identify which user behaviors most strongly predict 90-day retention. Propose three onboarding changes that drive those behaviors earlier.

Mini Project

Build a cohort analysis pipeline that ingests user event data from PostgreSQL, computes weekly retention matrices for the last 12 months with three cohort types (signup-week, first-purchase, feature-activated), and generates a dashboard with a retention heatmap and line chart overlay. Add automated alerts that trigger when any cohort's week-4 retention drops below 15% or declines more than 5 percentage points compared to the previous cohort. Export the retention matrix as a CSV with confidence intervals for team sharing. Write a monthly retention report summarizing trends, anomalies, and recommended product actions.

Built by the developers of Doda Browser, DodaZIP, and Durga Antivirus Pro.

← Previous Funnel & Conversion Analysis -- Complete Drop-Off Optimization Guide Next → A/B Testing Statistical Analysis -- Complete Guide to Experiment Design & Evaluation

Built by the developers of DodaTech

Doda Browser, DodaZIP & Durga Antivirus Pro

Home Browse Analytics