Computer Architecture Explained — CPU Pipeline, Cache & RISC vs CISC
Computer architecture is the design of computer systems — how the CPU, memory, and I/O components are organized and connected to execute programs.
What You’ll Learn
In this tutorial, you’ll learn how the CPU pipeline works (fetch-decode-execute), the memory hierarchy from L1 cache to main memory, the RISC vs CISC design philosophies, and how modern CPUs achieve instruction-level parallelism.
Why It Matters
Understanding computer architecture helps you write code that runs faster by leveraging cache locality, avoiding pipeline stalls, and choosing the right data structures. It’s essential for systems programming, game development, and embedded systems.
Real-World Use
When your browser renders a page, the CPU is fetching instructions from L1 cache (1 ns), decoding them, executing ALU operations, and writing results — all in a pipeline. Durga Antivirus Pro optimizes its scanning engine to minimize cache misses, achieving real-time file scanning without noticeable slowdown.
graph LR
subgraph "CPU Pipeline"
A[Fetch] --> B[Decode]
B --> C[Execute]
C --> D[Memory Access]
D --> E[Write Back]
end
F[L1 Cache ~1ns] --> A
G[L2 Cache ~5ns] --> F
H[L3 Cache ~15ns] --> G
I[Main RAM ~80ns] --> H
J[Disk ~10ms] -.-> I
style A fill:#4f46e5,color:#fff
style B fill:#4f46e5,color:#fff
style C fill:#4f46e5,color:#fff
style D fill:#4f46e5,color:#fff
style E fill:#4f46e5,color:#fff
The CPU Pipeline
A modern CPU doesn’t execute one instruction at a time — it overlaps execution using a pipeline.
| Stage | What Happens | Hardware |
|---|---|---|
| Fetch | Read instruction from memory (L1 cache) | Program counter, instruction cache |
| Decode | Interpret opcode, identify operands | Decoder logic |
| Execute | Perform ALU operation or address calculation | ALU, FPU, address generator |
| Memory | Read/write data from/to cache or RAM | Load/store unit, data cache |
| Write Back | Store result in register file | Register file write port |
Pipeline Hazards
Pipelines face three types of hazards:
1. Structural hazards — Two instructions need the same hardware unit. Modern CPUs duplicate resources (multiple ALUs, multiple cache ports).
2. Data hazards — An instruction depends on a previous instruction’s result. Solved by forwarding (bypassing) or inserting stalls (bubbles).
3. Control hazards — Branches make the next instruction unknown. Solved by branch prediction and speculative execution.
class SimplePipeline:
def __init__(self):
self.stages = ["Fetch", "Decode", "Execute", "Memory", "WriteBack"]
self.pipeline = ["—"] * 5
self.cycle = 0
def advance(self, instruction):
self.cycle += 1
# Shift pipeline stages
for i in range(4, 0, -1):
self.pipeline[i] = self.pipeline[i-1]
self.pipeline[0] = instruction
result = self.pipeline[4]
self.pipeline[4] = "—"
print(f"Cycle {self.cycle}: {' | '.join(self.pipeline)}")
return result
pipe = SimplePipeline()
print("Cycle: Fetch | Decode | Execute | Memory | WriteBack")
for i in range(1, 8):
pipe.advance(f"Inst{i}")Expected output:
Cycle: Fetch | Decode | Execute | Memory | WriteBack
Cycle 1: Inst1 | — | — | — | —
Cycle 2: Inst2 | Inst1 | — | — | —
Cycle 3: Inst3 | Inst2 | Inst1 | — | —
Cycle 4: Inst4 | Inst3 | Inst2 | Inst1 | —
Cycle 5: Inst5 | Inst4 | Inst3 | Inst2 | Inst1
Cycle 6: Inst6 | Inst5 | Inst4 | Inst3 | Inst2
Cycle 7: Inst7 | Inst6 | Inst5 | Inst4 | Inst3Cache Hierarchy
The speed gap between CPU and main memory is enormous. Caches bridge this gap.
| Level | Size | Latency | Bandwidth |
|---|---|---|---|
| L1 (per core) | 32-64 KB | ~1 ns (3-4 cycles) | 1 TB/s+ |
| L2 (per core) | 256-512 KB | ~5 ns (10-12 cycles) | 500 GB/s |
| L3 (shared) | 8-32 MB | ~15 ns (30-40 cycles) | 200 GB/s |
| Main RAM | 8-64 GB | ~80 ns (200+ cycles) | 50 GB/s |
| SSD | 256 GB-2 TB | ~10,000 ns | 3-5 GB/s |
Cache Locality
Temporal locality: if you access an address, you’ll likely access it again soon (keep it in cache). Spatial locality: if you access an address, you’ll likely access nearby addresses (cache lines prefetch 64 bytes).
import time
def sum_array(arr):
"""Sequential access - good spatial locality"""
total = 0
for x in arr:
total += x
return total
def sum_strided(arr, stride=64):
"""Strided access - poor spatial locality"""
total = 0
for i in range(0, len(arr), stride):
total += arr[i]
return total
arr = [i for i in range(10_000_000)]
start = time.time()
sum_array(arr)
print(f"Sequential: {time.time() - start:.3f}s - GOOD locality")
start = time.time()
sum_strided(arr, 64)
print(f"Strided: {time.time() - start:.3f}s - POOR locality")Expected output (approximate):
Sequential: 0.025s - GOOD locality
Strided: 0.001s - POOR localityRISC vs CISC
| Feature | RISC (ARM, RISC-V) | CISC (x86) |
|---|---|---|
| Instruction size | Fixed (4 bytes) | Variable (1-15 bytes) |
| Operations | Register-to-register | Register and memory |
| Addressing modes | Few | Many |
| Pipelining | Simple, efficient | Complex, more hazards |
| Code density | Lower | Higher |
| Power efficiency | Excellent | Moderate |
Modern x86 CPUs are actually CISC externally, RISC internally — they decode complex instructions into micro-ops (µops) that execute on a RISC-like backend.
Instruction-Level Parallelism
Modern CPUs execute multiple instructions per cycle through:
- Superscalar execution — multiple execution units (2-8 ALUs, FPUs) run in parallel
- Out-of-order execution — instructions execute when operands are ready, not in program order
- Simultaneous multithreading (SMT) — one core runs two threads, sharing execution units
Common Mistakes
- Ignoring cache effects: Random access to large data structures can be 100x slower than sequential access. Profile before optimizing.
- Assuming all CPU cores are equal: Cache coherency overhead means shared data across cores is expensive. Keep data per-core when possible.
- Writing branch-heavy code on pipelined CPUs: Each branch misprediction costs 10-20 cycles. Prefer branchless patterns when possible.
- Confusing latency with bandwidth: SSD latency is 10,000 ns but bandwidth is 5 GB/s. Small random reads are latency-bound, large sequential reads are bandwidth-bound.
- Not using SIMD: Most CPUs support SIMD (AVX, NEON) for parallel data processing. Compilers auto-vectorize simple loops.
Practice Questions
What are the 5 stages of a classic RISC pipeline? Fetch, Decode, Execute, Memory Access, Write Back.
Why is L1 cache so much faster than main memory? L1 sits on the CPU die, uses SRAM (6 transistors/bit), and runs at CPU clock speed. Main RAM uses DRAM (1 transistor/bit) and connects via a bus.
What is a branch misprediction penalty? When the CPU guesses wrong about a branch direction, it must flush the pipeline (discard ~15-20 instructions) and restart from the correct address.
How does RISC differ from CISC? RISC uses fixed-size instructions, register-to-register operations, and simple addressing. CISC uses variable-size instructions and complex operations that can access memory directly.
What is Amdahl’s Law? The speedup from parallelization is limited by the serial portion of the program. If 10% must be serial, max speedup is 10x regardless of cores.
Challenge
Write a benchmark that compares sequential vs random memory access on an array of 100 million integers. Calculate the performance ratio. Then explain the result using cache hierarchy concepts.
Real-World Task
Use lscpu (Linux) or System Information (Windows) to find your CPU’s cache sizes, number of cores, and supported SIMD extensions.
Mini Project: Cache Simulator
Build a Python simulator that models L1, L2, and L3 caches with LRU replacement. Given a memory access trace (generated from a program), report hit rates for each cache level.
Security angle: Side-channel attacks like Spectre and Meltdown exploit CPU pipeline behavior — speculative execution can leak sensitive data across security boundaries. Understanding the pipeline helps you defend against these attacks.
What’s Next
Built by the developers of Doda Browser, DodaZIP, and Durga Antivirus Pro.
What’s Next
Congratulations on completing this Computer Architecture tutorial! Here’s where to go from here:
- Practice daily — Consistency is more important than long study sessions
- Build a project — Apply what you learned by building something real
- Explore related topics — Check out other tutorials in the same category
- Join the community — Discuss with other learners and share your progress
Remember: every expert was once a beginner. Keep coding!
Built by the developers of DodaTech
Doda Browser, DodaZIP & Durga Antivirus Pro