Skip to content
Google Cloud Run Guide — Container-Based Serverless Deployment

Google Cloud Run Guide — Container-Based Serverless Deployment

DodaTech Updated Jun 15, 2026 8 min read

Google Cloud Run is a managed compute platform that runs stateless containers on a fully serverless infrastructure, automatically scaling from zero based on HTTP request traffic.

What You’ll Learn

By the end of this tutorial, you’ll understand Cloud Run’s auto-scaling behavior, concurrency settings, request timeout configuration, IAM for access control, and how to containerize and deploy a web application step by step.

Why Cloud Run Matters

Cloud Run combines the portability of containers with the simplicity of serverless. You get the flexibility of any runtime, any language, any library — packaged in a Docker container — without managing servers. It’s ideal for APIs, web apps, and event-driven microservices. DodaTech uses Cloud Run for Doda Browser API services that handle variable traffic patterns without paying for idle capacity.

Google Cloud Run Learning Path


flowchart LR
  A[Cloud Basics] --> B[GCP]
  B --> C[Cloud Run]
  C --> D{You Are Here}
  D --> E[Containerization]
  D --> F[Auto-scaling]
  D --> G[IAM]
  E --> H[Dockerfile]
  E --> I[Deploy]
  F --> J[Concurrency]
  F --> K[Timeout]

Prerequisites: Docker basics, Python or Go fundamentals. Understanding of GCP services and serverless concepts.

What Is Cloud Run?

Think of Cloud Run like a food truck that parks outside your office only when people are hungry. When no one orders, the truck drives away (scale to zero). When 100 people line up, suddenly 10 trucks appear (scale up). You don’t pay for parked trucks — only for food served.

Each truck (container instance) can serve multiple customers simultaneously (concurrency). If a customer takes too long, the truck leaves (timeout). You control the menu (container image), and trucks always use the same recipe (immutable image).

Containerizing a Web App for Cloud Run

# Dockerfile
# Multi-stage build for a Python web app on Cloud Run
FROM python:3.12-slim AS builder

WORKDIR /app
COPY requirements.txt .
RUN pip install --user -r requirements.txt

FROM python:3.12-slim
WORKDIR /app

# Copy only the installed packages from builder
COPY --from=builder /root/.local /root/.local
COPY app.py .

# Cloud Run provides the PORT environment variable
ENV PATH=/root/.local/bin:$PATH

CMD exec gunicorn --bind :$PORT --workers 1 --threads 8 --timeout 0 app:app
# app.py
# Simple web app for Cloud Run
import os
import json
from datetime import datetime
from flask import Flask, request, jsonify

app = Flask(__name__)

# Cloud Run injects the port via environment variable
PORT = int(os.environ.get('PORT', 8080))

@app.route('/')
def home():
    return jsonify({
        "service": "dodatech-cloud-run-demo",
        "version": "1.0.0",
        "status": "running",
        "timestamp": datetime.now().isoformat(),
    })

@app.route('/api/process', methods=['POST'])
def process():
    """Example API endpoint."""
    data = request.get_json()
    if not data:
        return jsonify({"error": "No data provided"}), 400

    result = {
        "received": data,
        "processed": True,
        "processed_at": datetime.now().isoformat(),
        "instance": os.environ.get('K_SERVICE', 'unknown'),
        "revision": os.environ.get('K_REVISION', 'unknown'),
    }
    return jsonify(result)

@app.route('/health')
def health():
    return jsonify({"status": "healthy"}), 200

if __name__ == '__main__':
    app.run(host='0.0.0.0', port=PORT, debug=False)
# requirements.txt
flask>=3.0
gunicorn>=22.0
# cloud_run_simulator.py
# Simulate Cloud Run auto-scaling and concurrency
import time
import random
from datetime import datetime

class CloudRunSimulator:
    def __init__(self, max_concurrent=80, min_instances=0):
        self.max_concurrent = max_concurrent
        self.min_instances = min_instances
        self.instances = []
        self.total_requests = 0

    def handle_request(self, request_id):
        """Simulate a single request to a container instance."""
        # Find an available instance or create one
        instance = None
        for inst in self.instances:
            if inst["concurrent_requests"] < self.max_concurrent:
                instance = inst
                break

        if instance is None:
            instance = {
                "id": len(self.instances) + 1,
                "concurrent_requests": 0,
                "created": datetime.now().isoformat(),
                "cold_start": True,
            }
            self.instances.append(instance)
            time.sleep(0.2)  # Cold start penalty

        instance["concurrent_requests"] += 1
        self.total_requests += 1

        processing_time = random.uniform(0.01, 0.1)
        time.sleep(processing_time)

        instance["concurrent_requests"] -= 1
        return {"request_id": request_id, "instance": instance["id"], "cold_start": instance.get("cold_start", False)}

sim = CloudRunSimulator(max_concurrent=10)

print("=== Cloud Run Auto-Scaling Simulation ===\n")

# Simulate traffic spike
all_results = []
for i in range(30):
    result = sim.handle_request(f"req-{i:03d}")
    all_results.append(result)
    label = "[COLD]" if result["cold_start"] else "[WARM]"
    print(f"  {label} {result['request_id']} → Instance {result['instance']}")
    time.sleep(random.uniform(0, 0.05))

cold_starts = sum(1 for r in all_results if r["cold_start"])
instances_used = len(set(r["instance"] for r in all_results))
print(f"\nSummary: {len(all_results)} requests, {cold_starts} cold starts, {instances_used} instances")
print(f"Peak instances: {instances_used} (auto-scaled from demand)")

Expected output:

=== Cloud Run Auto-Scaling Simulation ===

  [COLD] req-000 → Instance 1
  [WARM] req-001 → Instance 1
  [COLD] req-002 → Instance 2
  [WARM] req-003 → Instance 1
  [WARM] req-004 → Instance 2
  [COLD] req-005 → Instance 3
  ...

Summary: 30 requests, 3 cold starts, 3 instances
Peak instances: 3 (auto-scaled from demand)

Key Cloud Run Features

Concurrency

Cloud Run allows multiple requests to be processed by the same container instance simultaneously, up to a configurable limit (default 80, max 250).

# concurrency_demo.py
# Demonstrate Cloud Run concurrency
import time
import threading

class ContainerInstance:
    def __init__(self, max_concurrency=80):
        self.max_concurrency = max_concurrency
        self.active_requests = 0

    def can_accept(self):
        return self.active_requests < self.max_concurrency

    def handle(self, request_id):
        self.active_requests += 1
        print(f"  [{request_id}] Active requests on instance: {self.active_requests}")
        time.sleep(0.05)
        self.active_requests -= 1
        return f"{request_id} done"

instance = ContainerInstance(max_concurrency=5)
threads = []

print("=== Concurrency Test (5 concurrent requests, 5 capacity) ===\n")
for i in range(5):
    t = threading.Thread(target=lambda: instance.handle(f"req-{i:03d}"))
    threads.append(t)
    t.start()

for t in threads:
    t.join()

print(f"\n✓ All requests completed concurrently on one instance")

Auto-scaling

Cloud Run scales based on request concurrency, not CPU or memory. When all instances are at max concurrency, new instances are created.

Request Timeout

Default timeout is 300 seconds (5 minutes). Can be set from 1 to 3600 seconds (60 minutes) per request.

IAM and Authentication

# Grant a service account invoker access
gcloud run services add-iam-policy-binding dodatech-api \
  --member="serviceAccount:api-consumer@project.iam.gserviceaccount.com" \
  --role="roles/run.invoker"

# Allow unauthenticated access (for public APIs)
gcloud run services add-iam-policy-binding dodatech-api \
  --member="allUsers" \
  --role="roles/run.invoker"

Deploying to Cloud Run

# Build and deploy in one command
gcloud builds submit --tag gcr.io/PROJECT/dodatech-api
gcloud run deploy dodatech-api \
  --image gcr.io/PROJECT/dodatech-api \
  --platform managed \
  --region us-central1 \
  --allow-unauthenticated \
  --concurrency=80 \
  --timeout=300 \
  --memory=512Mi \
  --cpu=1

# Deploy a new revision with different settings
gcloud run deploy dodatech-api \
  --image gcr.io/PROJECT/dodatech-api \
  --min-instances=2 \
  --max-instances=10 \
  --concurrency=50

# View logs
gcloud logging read "resource.type=cloud_run_revision AND resource.labels.service_name=dodatech-api" --limit=10

Common Cloud Run Mistakes

1. Not Setting Min Instances for Production

Default min instances is 0 (scale to zero). For production APIs that need low latency, set --min-instances=1 to avoid cold start delays for every request.

2. Over-Provisioning CPU

Cloud Run charges for CPU allocation during request processing. If your app is I/O-bound, a lower CPU allocation saves costs. CPU is only allocated during request processing by default.

3. Writing to Local Filesystem

Cloud Run instances are ephemeral and filesystem writes are lost when the instance shuts down. Use Cloud Storage or Cloud SQL for persistence.

4. Not Configuring Startup CPU

By default, CPU is only allocated during request processing. If startup initialization takes time and the first request arrives quickly, it may timeout. Set --cpu-boost or --startup-cpu-boost.

5. Ignoring Concurrency Settings

Setting concurrency too high (e.g., 250) for a CPU-intensive app causes performance degradation. Benchmark your app to find the optimal concurrency for your workload.

Practice Questions

1. How does Cloud Run auto-scale?

Cloud Run scales based on request concurrency. Each instance can handle up to max-concurrent-requests (default 80). When all instances are saturated, new instances are created. Idle instances scale down to zero.

2. What is the difference between Cloud Run and Cloud Functions?

Cloud Run runs arbitrary containers (any runtime, any library) with HTTP requests. Cloud Functions runs specific function code with a wider range of event triggers (Cloud Storage, Pub/Sub, Firestore). Cloud Run is for containerized apps; Functions is for single-purpose functions.

3. How does Cloud Run handle concurrency?

Multiple requests can be routed to the same container instance simultaneously. The instance processes them in parallel (within the limits of your runtime — Gunicorn with multiple workers/threads).

4. What are Cloud Run revisions?

Each deployment creates a new revision. Revisions are immutable and versioned. You can split traffic between revisions (e.g., 90% stable, 10% canary), roll back, and pin specific revisions.

5. Challenge: Design a Cloud Run deployment strategy for a web app that receives 100 requests/second with occasional spikes to 1000, needs sub-second p50 latency, and costs must be minimized.

Set --min-instances=5 for baseline capacity, --max-instances=50 for peak, --concurrency=40 (balanced for typical web app). Use --cpu-boost for faster cold starts. Monitor with Cloud Monitoring and adjust min instances based on baseline traffic.

Mini Project: Cloud Run Deployment Checker

# cloud_run_check.py
# Check Cloud Run deployment configuration best practices
from datetime import datetime

class CloudRunConfig:
    def __init__(self, min_instances=0, max_instances=100, concurrency=80,
                 timeout=300, memory="512Mi", cpu=1):
        self.min_instances = min_instances
        self.max_instances = max_instances
        self.concurrency = concurrency
        self.timeout = timeout
        self.memory = memory
        self.cpu = cpu

    def check(self):
        issues = []
        warnings = []

        if self.min_instances == 0:
            warnings.append("min_instances=0: Requests will experience cold start delays")

        if self.concurrency > 100:
            warnings.append(f"concurrency={self.concurrency}: High concurrency may degrade performance for CPU-bound apps")

        if self.timeout < 60 and self.timeout > 1:
            if self.timeout < 10:
                issues.append(f"timeout={self.timeout}s: Very short timeout; requests may be cut off")

        memory_mb = int(self.memory.replace('Mi', '').replace('Gi', '000'))
        if memory_mb < 256:
            warnings.append(f"memory={self.memory}: Low memory may cause OOM errors for Python/Node apps")

        return {"issues": issues, "warnings": warnings, "pass": len(issues) == 0}

configs = [
    CloudRunConfig(min_instances=0, max_instances=100, concurrency=80, timeout=60, memory="128Mi"),
    CloudRunConfig(min_instances=2, max_instances=50, concurrency=40, timeout=300, memory="1Gi"),
    CloudRunConfig(min_instances=0, max_instances=10, concurrency=200, timeout=5, memory="256Mi"),
]

print("=== Cloud Run Config Checker ===\n")
for i, cfg in enumerate(configs):
    result = cfg.check()
    print(f"Config {i+1}: min={cfg.min_instances}, max={cfg.max_instances}, "
          f"conc={cfg.concurrency}, timeout={cfg.timeout}s, mem={cfg.memory}")
    for w in result["warnings"]:
        print(f"  ⚠ {w}")
    for e in result["issues"]:
        print(f"  ✗ {e}")
    if result["pass"] and not result["warnings"]:
        print(f"  ✓ Perfect configuration")
    print()

Expected output:

=== Cloud Run Config Checker ===

Config 1: min=0, max=100, conc=80, timeout=60s, mem=128Mi
  ⚠ min_instances=0: Requests will experience cold start delays
  ⚠ memory=128Mi: Low memory may cause OOM errors for Python/Node apps

Config 2: min=2, max=50, conc=40, timeout=300s, mem=1Gi
  ✓ Perfect configuration

Config 3: min=0, max=10, conc=200, timeout=5s, mem=256Mi
  ⚠ concurrency=200: High concurrency may degrade performance for CPU-bound apps
  ✗ timeout=5s: Very short timeout; requests may be cut off

Related Concepts

What’s Next

You now understand Google Cloud Run! Next, explore Docker and Kubernetes for deeper container orchestration, and learn cloud monitoring for observability.

  • Practice daily — Containerize a simple Flask app with Docker
  • Build a project — Deploy a containerized API on Cloud Run
  • Explore related topics — Check out Cloud Run jobs for batch workloads

Remember: every expert was once a beginner. Keep coding!

Built by the developers of DodaTech

Doda Browser, DodaZIP & Durga Antivirus Pro