Skip to content
Compiler Design Basics — Lexing, Parsing & Code Generation

Compiler Design Basics — Lexing, Parsing & Code Generation

DodaTech Updated Jun 20, 2026 9 min read

A compiler translates high-level source code into machine code through a series of well-defined phases — lexing, parsing, semantic analysis, optimisation, and code generation.

What You’ll Learn

In this tutorial, you’ll learn each phase of a compiler: how a lexer converts source to tokens, how a parser builds an AST, how semantic analysis checks types, how intermediate code is generated and optimised, and how final machine code is emitted.

Why It Matters

Every time you run TypeScript, Babel, a SQL query planner, or even a code linter, you’re using compiler techniques. Understanding how compilers work helps you write faster code, debug build errors, and design domain-specific languages.

Real-World Use

The TypeScript compiler (tsc) lexes .ts files, parses them into an AST, performs type checking, and emits JavaScript. Python’s CPython interpreter compiles source to bytecode before executing it. Durga Antivirus Pro uses pattern-matching techniques inspired by lexical analysis for threat signature scanning.

    graph LR
  A[Source Code] --> B[Lexer]
  B --> C[Token Stream]
  C --> D[Parser]
  D --> E[AST]
  E --> F[Semantic Analysis]
  F --> G[IR Generation]
  G --> H[Optimiser]
  H --> I[Code Generator]
  I --> J[Machine Code]
  style A fill:#4f46e5,color:#fff
  style J fill:#4f46e5,color:#fff
  

Phase 1: Lexical Analysis

The lexer reads source code character by character and groups them into tokens — meaningful units like keywords, identifiers, operators, and literals. Regular expressions (regex) describe token patterns, and a Deterministic Finite Automaton (DFA) matches them efficiently.

import re

class Lexer:
    def __init__(self, source):
        self.source = source
        self.pos = 0
        self.tokens = []

    token_spec = [
        ('NUMBER',   r'\d+(\.\d*)?'),
        ('IDENT',    r'[A-Za-z_][A-Za-z0-9_]*'),
        ('PLUS',     r'\+'),
        ('MINUS',    r'-'),
        ('MUL',      r'\*'),
        ('DIV',      r'/'),
        ('LPAREN',   r'\('),
        ('RPAREN',   r'\)'),
        ('ASSIGN',   r'='),
        ('SEMI',     r';'),
        ('SKIP',     r'[ \t\n]+'),
        ('MISMATCH', r'.'),
    ]

    def tokenize(self):
        while self.pos < len(self.source):
            match = None
            for tok_type, pattern in self.token_spec:
                regex = re.compile(pattern)
                match = regex.match(self.source, self.pos)
                if match:
                    text = match.group(0)
                    if tok_type == 'MISMATCH':
                        raise SyntaxError(f'Unexpected: {text}')
                    if tok_type != 'SKIP':
                        self.tokens.append((tok_type, text))
                    self.pos = match.end()
                    break
        return self.tokens

code = 'x = 5 + 3;'
lexer = Lexer(code)
tokens = lexer.tokenize()
print(tokens)

Expected output:

[('IDENT', 'x'), ('ASSIGN', '='), ('NUMBER', '5'), ('PLUS', '+'), ('NUMBER', '3'), ('SEMI', ';')]

Phase 2: Parsing

The parser takes tokens and builds an Abstract Syntax Tree (AST). There are two main approaches:

  • Top-down (recursive descent): easy to write by hand, starts from the start symbol
  • Bottom-up (LR): handles more grammars, used by Yacc/Bison

Recursive Descent Parser

Each grammar rule becomes a function. The parser follows the grammar’s production rules recursively.

class ASTNode:
    def __init__(self, type, value=None, children=None):
        self.type = type
        self.value = value
        self.children = children or []

    def __repr__(self):
        if self.type == 'number':
            return str(self.value)
        return f'({self.value} {" ".join(repr(c) for c in self.children)})'

class Parser:
    def __init__(self, tokens):
        self.tokens = tokens
        self.pos = 0

    def peek(self):
        return self.tokens[self.pos] if self.pos < len(self.tokens) else None

    def consume(self, expected=None):
        tok = self.peek()
        if expected and tok and tok[0] != expected:
            raise SyntaxError(f'Expected {expected}, got {tok[0]}')
        self.pos += 1
        return tok

    # expr ::= term (('+' | '-') term)*
    def parse_expr(self):
        left = self.parse_term()
        while self.peek() and self.peek()[0] in ('PLUS', 'MINUS'):
            op = self.consume()
            right = self.parse_term()
            left = ASTNode('binop', op[1], [left, right])
        return left

    # term ::= factor (('*' | '/') factor)*
    def parse_term(self):
        left = self.parse_factor()
        while self.peek() and self.peek()[0] in ('MUL', 'DIV'):
            op = self.consume()
            right = self.parse_factor()
            left = ASTNode('binop', op[1], [left, right])
        return left

    # factor ::= NUMBER | '(' expr ')'
    def parse_factor(self):
        tok = self.peek()
        if tok[0] == 'NUMBER':
            self.consume('NUMBER')
            return ASTNode('number', int(tok[1]))
        elif tok[0] == 'LPAREN':
            self.consume('LPAREN')
            expr = self.parse_expr()
            self.consume('RPAREN')
            return expr
        raise SyntaxError(f'Unexpected token: {tok}')

    def parse(self):
        return self.parse_expr()

tokens = [('NUMBER','3'),('PLUS','+'),('NUMBER','5'),('MUL','*'),('NUMBER','2')]
parser = Parser(tokens)
ast = parser.parse()
print(f'AST: {ast}')

Expected output:

AST: (+ 3 (* 5 2))

Operator Precedence

The parser above gives * higher precedence than + because parse_expr calls parse_term first. This correctly evaluates 3 + 5 * 2 as 3 + (5 * 2).

LL vs LR Parsing

FeatureLL (top-down)LR (bottom-up)
DirectionLeft-to-right, leftmost derivationLeft-to-right, rightmost derivation
GrammarMust be LL(k) (no left recursion)Handles left recursion
Table sizeSmallerLarger
ErrorsEarly detectionDetects later
ToolsANTLR, handwrittenYacc, Bison, Bison++

Phase 3: Semantic Analysis

The semantic analyser checks meaning beyond syntax — type checking, scope resolution, and declaration verification.

class SymbolTable:
    def __init__(self):
        self.symbols = {}
        self.parent = None

    def define(self, name, var_type):
        self.symbols[name] = var_type

    def lookup(self, name):
        if name in self.symbols:
            return self.symbols[name]
        if self.parent:
            return self.parent.lookup(name)
        return None

    def enter_scope(self):
        child = SymbolTable()
        child.parent = self
        return child

def type_check(node, symtab):
    if node.type == 'number':
        return 'int'
    elif node.type == 'ident':
        var_type = symtab.lookup(node.value)
        if not var_type:
            raise TypeError(f'Undefined variable: {node.value}')
        return var_type
    elif node.type == 'binop':
        left_type = type_check(node.children[0], symtab)
        right_type = type_check(node.children[1], symtab)
        if left_type != right_type:
            raise TypeError(
                f'Type mismatch: {left_type} vs {right_type}')
        return left_type
    elif node.type == 'assign':
        var_type = symtab.lookup(node.children[0].value)
        val_type = type_check(node.children[1], symtab)
        if var_type != val_type:
            raise TypeError(
                f'Cannot assign {val_type} to {var_type}')
        return var_type

symtab = SymbolTable()
symtab.define('x', 'int')
symtab.define('y', 'string')

ast1 = ASTNode('assign', '=', [
    ASTNode('ident', 'x'),
    ASTNode('number', 42),
])
print(type_check(ast1, symtab))  # int

try:
    ast2 = ASTNode('assign', '=', [
        ASTNode('ident', 'y'),
        ASTNode('number', 42),
    ])
    print(type_check(ast2, symtab))
except TypeError as e:
    print(f'Type error: {e}')

Expected output:

int
Type error: Cannot assign int to string

Phase 4: Intermediate Representation (IR)

The compiler generates an IR — a platform-independent representation. Common forms:

  • Three-address code: t1 = a + b; t2 = t1 * c;
  • SSA (Static Single Assignment): each variable assigned exactly once
  • Stack-based: like Java bytecode
def generate_ir(node, temp_counter=[0]):
    if node.type == 'number':
        return str(node.value)
    elif node.type == 'binop':
        left = generate_ir(node.children[0], temp_counter)
        right = generate_ir(node.children[1], temp_counter)
        temp_counter[0] += 1
        t = f't{temp_counter[0]}'
        print(f'{t} = {left} {node.value} {right}')
        return t

print('Three-address code for (3 + 5) * 2:')
ast = ASTNode('binop', '*', [
    ASTNode('binop', '+', [ASTNode('number', 3), ASTNode('number', 5)]),
    ASTNode('number', 2),
])
generate_ir(ast)

Expected output:

Three-address code for (3 + 5) * 2:
t1 = 3 + 5
t2 = t1 * 2

Phase 5: Optimisation Passes

OptimisationDescriptionExample
Constant foldingEvaluate constant expressions at compile time2 + 35
Dead code eliminationRemove unreachable codereturn 1; x = 2;return 1;
Loop unrollingDuplicate loop body to reduce branchingfor i in 0..2: f(i)f(0); f(1); f(2)
Common subexpression eliminationReuse computed valuesa = x + y; b = x + yt = x + y; a = t; b = t
Inline expansionReplace function call with bodysquare(3)3 * 3
def constant_fold(node):
    """Fold constant expressions in the AST"""
    if node.type == 'binop':
        left = constant_fold(node.children[0])
        right = constant_fold(node.children[1])
        if left.type == 'number' and right.type == 'number':
            ops = {'+': lambda a, b: a + b,
                   '-': lambda a, b: a - b,
                   '*': lambda a, b: a * b,
                   '/': lambda a, b: a // b}
            result = ops[node.value](left.value, right.value)
            return ASTNode('number', result)
        return ASTNode('binop', node.value, [left, right])
    return node

ast = ASTNode('binop', '+', [
    ASTNode('binop', '*', [ASTNode('number', 2), ASTNode('number', 3)]),
    ASTNode('number', 5),
])
print(f'Before: {ast}')
optimised = constant_fold(ast)
print(f'After:  {optimised}')

Expected output:

Before: (+ (* 2 3) 5)
After:  11

Phase 6: Code Generation

The final phase emits target machine code or bytecode. A simple stack-based VM:

def compile_to_vm(node):
    """Generate stack-based VM instructions"""
    code = []
    if node.type == 'number':
        code.append(f'PUSH {node.value}')
    elif node.type == 'binop':
        code.extend(compile_to_vm(node.children[0]))
        code.extend(compile_to_vm(node.children[1]))
        op_map = {'+': 'ADD', '-': 'SUB', '*': 'MUL', '/': 'DIV'}
        code.append(op_map[node.value])
    return code

ast = ASTNode('binop', '+', [
    ASTNode('number', 3),
    ASTNode('binop', '*', [
        ASTNode('number', 5),
        ASTNode('number', 2),
    ]),
])

instructions = compile_to_vm(ast)
print('VM Instructions:')
for i, instr in enumerate(instructions):
    print(f'  {i:3d}: {instr}')
print(f'\nResult: {3 + 5 * 2}')

Expected output:

VM Instructions:
    0: PUSH 3
    1: PUSH 5
    2: PUSH 2
    3: MUL
    4: ADD

Result: 13

Common Mistakes

1. Left recursion in recursive descent parsers

A rule like expr ::= expr + term causes infinite recursion. Use right recursion or left-factoring instead.

2. Confusing lexer and parser roles

The lexer produces tokens; the parser builds structure. Don’t try to balance parentheses or check syntax in the lexer.

3. Ignoring operator precedence

Without precedence handling, 3 + 5 * 2 evaluates as (3 + 5) * 2. The grammar must encode precedence through non-terminal hierarchy.

4. Not handling errors gracefully

A good compiler reports all errors in one pass and recovers to find more. Stopping at the first error frustrates users.

5. Over-optimising

Aggressive inlining and loop unrolling can bloat binary size and increase compile time with minimal runtime benefit.

6. Forgetting the symbol table in semantic analysis

Without tracking scopes, variable shadowing and type checking break. Each scope needs its own symbol table with parent linkage.

Practice Questions

  1. What does a lexer produce? A stream of tokens — each token is a (type, value) pair representing keywords, identifiers, operators, or literals.

  2. What’s the difference between LL and LR parsing? LL parsers produce a leftmost derivation and work top-down. LR parsers produce a rightmost derivation and work bottom-up. LR handles more grammars.

  3. What is an Abstract Syntax Tree? A tree representation of the program’s syntactic structure, omitting concrete syntax details like parentheses and semicolons.

  4. Why is intermediate code useful? It decouples the frontend (language-specific) from the backend (machine-specific), enabling support for multiple source and target languages.

  5. What is constant folding? Evaluating constant expressions at compile time. x = 60 * 60 * 24 becomes x = 86400, saving runtime computation.

Challenge

Extend the recursive descent parser to support variable declarations (let x = 5;) and print statements (print x;). Generate three-address code for both.

Real-World Task

Run gcc -O2 -S -fopt-info program.c on a small C program. Examine the optimisation dump — what passes ran? How did constant folding or dead code elimination change the output?

FAQ

What is the difference between a compiler and an interpreter?
A compiler translates the entire program to machine code before execution. An interpreter executes code line by line. Many modern languages (Java, Python) use a hybrid: compile to bytecode, then interpret or JIT-compile.
What is SSA form?
Static Single Assignment form ensures each variable is assigned exactly once. It simplifies optimisations like constant propagation and dead code elimination.
What tools are used to generate parsers?
ANTLR (LL), Yacc/Bison (LR), and PEG parsers (packrat) are popular. Many production compilers hand-write recursive descent parsers for better error messages.
How do JIT compilers work?
Just-In-Time compilers translate bytecode to native machine code at runtime. They profile hot paths and optimise them aggressively, combining the portability of an interpreter with native speed.
What is a transpiler?
A transpiler (source-to-source compiler) converts between high-level languages — TypeScript to JavaScript, Babel transforms modern JS to ES5.

Mini Project: Simple Expression Compiler

Build a compiler for arithmetic expressions:

  1. Lexer: tokenises numbers, operators, parentheses
  2. Parser: builds AST with correct precedence
  3. Code generator: emits stack-based VM instructions
  4. VM: executes the instructions and returns the result

Security angle: Compiler hardening (stack canaries, ASLR, CFI) prevents exploits. These are the same protections Durga Antivirus Pro checks during vulnerability scanning.

What’s Next

Before moving on, you should understand:

  • The phases of a compiler: lexer → parser → semantic analysis → IR → code generation
  • Recursive descent parsing and operator precedence
  • How SSA and three-address code represent programs
  • Common optimisation passes

Built by the developers of Doda Browser, DodaZIP, and Durga Antivirus Pro.

Built by the developers of DodaTech

Doda Browser, DodaZIP & Durga Antivirus Pro