Learn Computer Architecture Explained — CPU Pipeline, Cache & RISC vs CISC

Computer Architecture Explained — CPU Pipeline, Cache & RISC vs CISC

DodaTech Updated Jun 15, 2026 6 min read

Computer architecture is the design of computer systems — how the CPU, memory, and I/O components are organized and connected to execute programs.

What You’ll Learn

In this tutorial, you’ll learn how the CPU pipeline works (fetch-decode-execute), the memory hierarchy from L1 cache to main memory, the RISC vs CISC design philosophies, and how modern CPUs achieve instruction-level parallelism.

Why It Matters

Understanding computer architecture helps you write code that runs faster by leveraging cache locality, avoiding pipeline stalls, and choosing the right data structures. It’s essential for systems programming, game development, and embedded systems.

Real-World Use

When your browser renders a page, the CPU is fetching instructions from L1 cache (1 ns), decoding them, executing ALU operations, and writing results — all in a pipeline. Durga Antivirus Pro optimizes its scanning engine to minimize cache misses, achieving real-time file scanning without noticeable slowdown.


graph LR
  subgraph "CPU Pipeline"
    A[Fetch] --> B[Decode]
    B --> C[Execute]
    C --> D[Memory Access]
    D --> E[Write Back]
  end
  F[L1 Cache ~1ns] --> A
  G[L2 Cache ~5ns] --> F
  H[L3 Cache ~15ns] --> G
  I[Main RAM ~80ns] --> H
  J[Disk ~10ms] -.-> I
  style A fill:#4f46e5,color:#fff
  style B fill:#4f46e5,color:#fff
  style C fill:#4f46e5,color:#fff
  style D fill:#4f46e5,color:#fff
  style E fill:#4f46e5,color:#fff

The CPU Pipeline

A modern CPU doesn’t execute one instruction at a time — it overlaps execution using a pipeline.

Stage	What Happens	Hardware
Fetch	Read instruction from memory (L1 cache)	Program counter, instruction cache
Decode	Interpret opcode, identify operands	Decoder logic
Execute	Perform ALU operation or address calculation	ALU, FPU, address generator
Memory	Read/write data from/to cache or RAM	Load/store unit, data cache
Write Back	Store result in register file	Register file write port

Pipeline Hazards

Pipelines face three types of hazards:

1. Structural hazards — Two instructions need the same hardware unit. Modern CPUs duplicate resources (multiple ALUs, multiple cache ports).

2. Data hazards — An instruction depends on a previous instruction’s result. Solved by forwarding (bypassing) or inserting stalls (bubbles).

3. Control hazards — Branches make the next instruction unknown. Solved by branch prediction and speculative execution.

class SimplePipeline:
    def __init__(self):
        self.stages = ["Fetch", "Decode", "Execute", "Memory", "WriteBack"]
        self.pipeline = ["—"] * 5
        self.cycle = 0

    def advance(self, instruction):
        self.cycle += 1
        # Shift pipeline stages
        for i in range(4, 0, -1):
            self.pipeline[i] = self.pipeline[i-1]
        self.pipeline[0] = instruction
        result = self.pipeline[4]
        self.pipeline[4] = "—"
        print(f"Cycle {self.cycle}: {' | '.join(self.pipeline)}")
        return result

pipe = SimplePipeline()
print("Cycle: Fetch | Decode | Execute | Memory | WriteBack")
for i in range(1, 8):
    pipe.advance(f"Inst{i}")

Expected output:

Cycle: Fetch | Decode | Execute | Memory | WriteBack
Cycle 1: Inst1 | — | — | — | —
Cycle 2: Inst2 | Inst1 | — | — | —
Cycle 3: Inst3 | Inst2 | Inst1 | — | —
Cycle 4: Inst4 | Inst3 | Inst2 | Inst1 | —
Cycle 5: Inst5 | Inst4 | Inst3 | Inst2 | Inst1
Cycle 6: Inst6 | Inst5 | Inst4 | Inst3 | Inst2
Cycle 7: Inst7 | Inst6 | Inst5 | Inst4 | Inst3

Cache Hierarchy

The speed gap between CPU and main memory is enormous. Caches bridge this gap.

Level	Size	Latency	Bandwidth
L1 (per core)	32-64 KB	~1 ns (3-4 cycles)	1 TB/s+
L2 (per core)	256-512 KB	~5 ns (10-12 cycles)	500 GB/s
L3 (shared)	8-32 MB	~15 ns (30-40 cycles)	200 GB/s
Main RAM	8-64 GB	~80 ns (200+ cycles)	50 GB/s
SSD	256 GB-2 TB	~10,000 ns	3-5 GB/s

Cache Locality

Temporal locality: if you access an address, you’ll likely access it again soon (keep it in cache). Spatial locality: if you access an address, you’ll likely access nearby addresses (cache lines prefetch 64 bytes).

import time

def sum_array(arr):
    """Sequential access - good spatial locality"""
    total = 0
    for x in arr:
        total += x
    return total

def sum_strided(arr, stride=64):
    """Strided access - poor spatial locality"""
    total = 0
    for i in range(0, len(arr), stride):
        total += arr[i]
    return total

arr = [i for i in range(10_000_000)]

start = time.time()
sum_array(arr)
print(f"Sequential: {time.time() - start:.3f}s - GOOD locality")

start = time.time()
sum_strided(arr, 64)
print(f"Strided: {time.time() - start:.3f}s - POOR locality")

Expected output (approximate):

Sequential: 0.025s - GOOD locality
Strided: 0.001s - POOR locality

RISC vs CISC

Feature	RISC (ARM, RISC-V)	CISC (x86)
Instruction size	Fixed (4 bytes)	Variable (1-15 bytes)
Operations	Register-to-register	Register and memory
Addressing modes	Few	Many
Pipelining	Simple, efficient	Complex, more hazards
Code density	Lower	Higher
Power efficiency	Excellent	Moderate

Modern x86 CPUs are actually CISC externally, RISC internally — they decode complex instructions into micro-ops (µops) that execute on a RISC-like backend.

Instruction-Level Parallelism

Modern CPUs execute multiple instructions per cycle through:

Superscalar execution — multiple execution units (2-8 ALUs, FPUs) run in parallel
Out-of-order execution — instructions execute when operands are ready, not in program order
Simultaneous multithreading (SMT) — one core runs two threads, sharing execution units

Common Mistakes

Ignoring cache effects: Random access to large data structures can be 100x slower than sequential access. Profile before optimizing.
Assuming all CPU cores are equal: Cache coherency overhead means shared data across cores is expensive. Keep data per-core when possible.
Writing branch-heavy code on pipelined CPUs: Each branch misprediction costs 10-20 cycles. Prefer branchless patterns when possible.
Confusing latency with bandwidth: SSD latency is 10,000 ns but bandwidth is 5 GB/s. Small random reads are latency-bound, large sequential reads are bandwidth-bound.
Not using SIMD: Most CPUs support SIMD (AVX, NEON) for parallel data processing. Compilers auto-vectorize simple loops.

Practice Questions

What are the 5 stages of a classic RISC pipeline? Fetch, Decode, Execute, Memory Access, Write Back.
Why is L1 cache so much faster than main memory? L1 sits on the CPU die, uses SRAM (6 transistors/bit), and runs at CPU clock speed. Main RAM uses DRAM (1 transistor/bit) and connects via a bus.
What is a branch misprediction penalty? When the CPU guesses wrong about a branch direction, it must flush the pipeline (discard ~15-20 instructions) and restart from the correct address.
How does RISC differ from CISC? RISC uses fixed-size instructions, register-to-register operations, and simple addressing. CISC uses variable-size instructions and complex operations that can access memory directly.
What is Amdahl’s Law? The speedup from parallelization is limited by the serial portion of the program. If 10% must be serial, max speedup is 10x regardless of cores.

Challenge

Write a benchmark that compares sequential vs random memory access on an array of 100 million integers. Calculate the performance ratio. Then explain the result using cache hierarchy concepts.

Real-World Task

Use lscpu (Linux) or System Information (Windows) to find your CPU’s cache sizes, number of cores, and supported SIMD extensions.

Mini Project: Cache Simulator

Build a Python simulator that models L1, L2, and L3 caches with LRU replacement. Given a memory access trace (generated from a program), report hit rates for each cache level.

Security angle: Side-channel attacks like Spectre and Meltdown exploit CPU pipeline behavior — speculative execution can leak sensitive data across security boundaries. Understanding the pipeline helps you defend against these attacks.

What’s Next

Network Protocols — Next Lesson

Review: Compiler Design

Cryptography Basics

Built by the developers of Doda Browser, DodaZIP, and Durga Antivirus Pro.

What’s Next

Congratulations on completing this Computer Architecture tutorial! Here’s where to go from here:

Practice daily — Consistency is more important than long study sessions
Build a project — Apply what you learned by building something real
Explore related topics — Check out other tutorials in the same category
Join the community — Discuss with other learners and share your progress

Remember: every expert was once a beginner. Keep coding!

Previous Compiler Design Explained — Lexical Analysis, Parsing & Code Generation Next Network Protocols Explained — TCP, UDP, IP Addressing & DNS

Built by the developers of DodaTech

Doda Browser, DodaZIP & Durga Antivirus Pro

Home Browse Computer Science