Learn Computer: Compiler Design Explained — Lexical Analysis, Parsing & Code Generation

Compiler Design Explained — Lexical Analysis, Parsing & Code Generation

DodaTech Updated Jun 15, 2026 6 min read

A compiler translates high-level source code into low-level machine code — performing lexical analysis, parsing, semantic analysis, optimization, and code generation.

What You’ll Learn

In this tutorial, you’ll learn each phase of a compiler, the difference between top-down and bottom-up parsing, how an Abstract Syntax Tree (AST) is built, and how code optimization works.

Why It Matters

Compilers transform the code you write into something a computer can execute. Understanding compiler design helps you write more efficient code, debug obscure errors, and design domain-specific languages. Every framework, linter, and transpiler uses compiler techniques.

Real-World Use

When you write TypeScript, the TypeScript compiler (tsc) lexes, parses, type-checks, and emits JavaScript. Python’s CPython interpreter compiles source to bytecode before execution. Durga Antivirus Pro uses pattern-matching techniques similar to lexical analysis to scan files for threat signatures.


graph LR
  A[Source Code] --> B[Lexical Analysis]
  B --> C[Token Stream]
  C --> D[Syntax Analysis / Parsing]
  D --> E[Abstract Syntax Tree]
  E --> F[Semantic Analysis]
  F --> G[Intermediate Code Gen]
  G --> H[Optimization]
  H --> I[Code Generation]
  I --> J[Machine Code]
  style A fill:#4f46e5,color:#fff
  style J fill:#4f46e5,color:#fff

Phase 1: Lexical Analysis

The lexer (or tokenizer) reads the source code character by character and groups them into tokens — meaningful units like keywords, identifiers, operators, and literals.

import re

class Lexer:
    def __init__(self, source):
        self.source = source
        self.pos = 0
        self.tokens = []
        self.token_patterns = [
            ("NUMBER", r"\d+(\.\d*)?"),
            ("PLUS", r"\+"),
            ("MINUS", r"-"),
            ("MULTIPLY", r"\*"),
            ("DIVIDE", r"/"),
            ("LPAREN", r"\("),
            ("RPAREN", r"\)"),
            ("IDENTIFIER", r"[a-zA-Z_][a-zA-Z0-9_]*"),
            ("SKIP", r"[ \t\n]+"),
            ("ERROR", r"."),
        ]

    def tokenize(self):
        while self.pos < len(self.source):
            match = None
            for token_type, pattern in self.token_patterns:
                regex = re.compile(pattern)
                match = regex.match(self.source, self.pos)
                if match:
                    text = match.group(0)
                    if token_type != "SKIP" and token_type != "ERROR":
                        self.tokens.append((token_type, text))
                    elif token_type == "ERROR":
                        raise SyntaxError(f"Unexpected char: {text}")
                    self.pos = match.end()
                    break
        return self.tokens

lexer = Lexer("3 + 5 * (10 - 2)")
tokens = lexer.tokenize()
print(tokens)

Expected output:

[('NUMBER', '3'), ('PLUS', '+'), ('NUMBER', '5'), ('MULTIPLY', '*'), ('LPAREN', '('), ('NUMBER', '10'), ('MINUS', '-'), ('NUMBER', '2'), ('RPAREN', ')')]

Phase 2: Syntax Analysis (Parsing)

The parser takes tokens and builds an Abstract Syntax Tree (AST) — a tree representation of the program’s grammatical structure.

Top-Down Parsing (Recursive Descent)

Starts from the start symbol and applies production rules. Easy to write by hand for simple grammars.

Bottom-Up Parsing (LR)

Starts from the input tokens and reduces them to the start symbol using a stack. Handles more complex grammars. Tools like Yacc and Bison generate bottom-up parsers.

class ASTNode:
    def __init__(self, type, value=None, left=None, right=None):
        self.type = type
        self.value = value
        self.left = left
        self.right = right

    def __repr__(self):
        if self.type == "NUMBER":
            return str(self.value)
        return f"({self.left} {self.value} {self.right})"

class Parser:
    def __init__(self, tokens):
        self.tokens = tokens
        self.pos = 0

    def peek(self):
        return self.tokens[self.pos] if self.pos < len(self.tokens) else None

    def consume(self, expected_type=None):
        token = self.peek()
        if expected_type and token and token[0] != expected_type:
            raise SyntaxError(f"Expected {expected_type}, got {token}")
        self.pos += 1
        return token

    def parse_expression(self):
        # expr ::= term ( (+|-) term )*
        left = self.parse_term()
        while self.peek() and self.peek()[0] in ("PLUS", "MINUS"):
            op = self.consume()
            right = self.parse_term()
            left = ASTNode("BINOP", op[1], left, right)
        return left

    def parse_term(self):
        # term ::= factor ( (*|/) factor )*
        left = self.parse_factor()
        while self.peek() and self.peek()[0] in ("MULTIPLY", "DIVIDE"):
            op = self.consume()
            right = self.parse_factor()
            left = ASTNode("BINOP", op[1], left, right)
        return left

    def parse_factor(self):
        # factor ::= NUMBER | ( expr )
        token = self.peek()
        if token[0] == "NUMBER":
            self.consume("NUMBER")
            return ASTNode("NUMBER", int(token[1]))
        elif token[0] == "LPAREN":
            self.consume("LPAREN")
            expr = self.parse_expression()
            self.consume("RPAREN")
            return expr
        raise SyntaxError(f"Unexpected token: {token}")

    def parse(self):
        return self.parse_expression()

def evaluate(node):
    if node.type == "NUMBER":
        return node.value
    left = evaluate(node.left)
    right = evaluate(node.right)
    if node.value == "+": return left + right
    if node.value == "-": return left - right
    if node.value == "*": return left * right
    if node.value == "/": return left / right

tokens = [("NUMBER", "3"), ("PLUS", "+"), ("NUMBER", "5"), ("MULTIPLY", "*"),
          ("NUMBER", "2")]
parser = Parser(tokens)
ast = parser.parse()
print(f"AST: {ast}")
result = evaluate(ast)
print(f"Result: {result}")

Expected output:

AST: (3 + (5 * 2))
Result: 13

Phase 3: Semantic Analysis

The semantic analyzer checks for meaning-level errors:

Type checking (adding string + number?)
Scope resolution (is this variable defined?)
Declaration checking (function called before definition?)

Phase 4: Intermediate Code Generation and Optimization

The compiler generates an intermediate representation (IR) — a platform-independent code. Optimization passes then improve the IR:

Optimization	Description
Constant folding	`2 + 3` → `5` at compile time
Dead code elimination	Remove unreachable code
Loop unrolling	Duplicate loop body to reduce branching
Inline expansion	Replace function call with function body
Common subexpression elimination	Reuse computed values

Phase 5: Code Generation

The final phase generates target machine code (x86, ARM, WebAssembly, or JVM bytecode). This involves register allocation, instruction selection, and instruction scheduling.

Common Mistakes

Confusing lexical and syntax analysis: The lexer produces tokens; the parser builds structure. Don’t try to parse in the lexer.
Left recursion in top-down parsers: A rule like expr ::= expr + term causes infinite recursion. Use left-factoring to eliminate it.
Ignoring operator precedence: 3 + 5 * 2 should be 3 + (5 * 2), not (3 + 5) * 2. The parser must enforce precedence.
Not handling errors gracefully: A good compiler reports all errors in one pass, not stopping at the first one.
Over-optimizing: Aggressive optimizations can increase compile time and binary size without measurable runtime benefit.

Practice Questions

What does a lexer do? Converts source code characters into tokens (keywords, identifiers, operators, literals).
What’s the difference between top-down and bottom-up parsing? Top-down (recursive descent) starts from the start symbol and expands. Bottom-up (LR) starts from input and reduces to the start symbol.
What is an AST? An Abstract Syntax Tree represents the grammatical structure of code without concrete syntax details like parentheses or semicolons.
Why is intermediate code useful? It decouples the frontend (language-specific) from the backend (machine-specific), enabling support for multiple source and target languages.
What is constant folding? Evaluating constant expressions at compile time instead of runtime. x = 60 * 60 * 24 becomes x = 86400.

Challenge

Extend the expression parser to support variables and assignment. Add an identifier token type and a symbol table to store variable values.

Real-World Task

Run gcc -O2 -S program.c on a simple C program. Look at the assembly output. Can you spot any optimizations (constant folding, dead code elimination)?

Mini Project: Simple Expression Compiler

Build a four-phase compiler for arithmetic expressions: lexer → parser → code generator → output. Generate stack-based virtual machine instructions (like Java bytecode) for expressions.

Security angle: Compiler hardening techniques (stack canaries, ASLR, control-flow integrity) prevent memory corruption exploits. These are the same protections Durga Antivirus Pro checks for during vulnerability scanning.

What’s Next

Computer Architecture — Next Lesson

Review: DBMS Concepts

Network Protocols

Built by the developers of Doda Browser, DodaZIP, and Durga Antivirus Pro.

What’s Next

Congratulations on completing this Compiler Design tutorial! Here’s where to go from here:

Practice daily — Consistency is more important than long study sessions
Build a project — Apply what you learned by building something real
Explore related topics — Check out other tutorials in the same category
Join the community — Discuss with other learners and share your progress

Remember: every expert was once a beginner. Keep coding!

Previous DBMS Concepts Explained — Relational Model, Normalization & ACID Next Computer Architecture Explained — CPU Pipeline, Cache & RISC vs CISC

Built by the developers of DodaTech

Doda Browser, DodaZIP & Durga Antivirus Pro

Home Browse Computer Science