Compiler Design Basics — Lexing, Parsing & Code Generation
A compiler translates high-level source code into machine code through a series of well-defined phases — lexing, parsing, semantic analysis, optimisation, and code generation.
What You’ll Learn
In this tutorial, you’ll learn each phase of a compiler: how a lexer converts source to tokens, how a parser builds an AST, how semantic analysis checks types, how intermediate code is generated and optimised, and how final machine code is emitted.
Why It Matters
Every time you run TypeScript, Babel, a SQL query planner, or even a code linter, you’re using compiler techniques. Understanding how compilers work helps you write faster code, debug build errors, and design domain-specific languages.
Real-World Use
The TypeScript compiler (tsc) lexes .ts files, parses them into an AST, performs type checking, and emits JavaScript. Python’s CPython interpreter compiles source to bytecode before executing it. Durga Antivirus Pro uses pattern-matching techniques inspired by lexical analysis for threat signature scanning.
graph LR
A[Source Code] --> B[Lexer]
B --> C[Token Stream]
C --> D[Parser]
D --> E[AST]
E --> F[Semantic Analysis]
F --> G[IR Generation]
G --> H[Optimiser]
H --> I[Code Generator]
I --> J[Machine Code]
style A fill:#4f46e5,color:#fff
style J fill:#4f46e5,color:#fff
Phase 1: Lexical Analysis
The lexer reads source code character by character and groups them into tokens — meaningful units like keywords, identifiers, operators, and literals. Regular expressions (regex) describe token patterns, and a Deterministic Finite Automaton (DFA) matches them efficiently.
import re
class Lexer:
def __init__(self, source):
self.source = source
self.pos = 0
self.tokens = []
token_spec = [
('NUMBER', r'\d+(\.\d*)?'),
('IDENT', r'[A-Za-z_][A-Za-z0-9_]*'),
('PLUS', r'\+'),
('MINUS', r'-'),
('MUL', r'\*'),
('DIV', r'/'),
('LPAREN', r'\('),
('RPAREN', r'\)'),
('ASSIGN', r'='),
('SEMI', r';'),
('SKIP', r'[ \t\n]+'),
('MISMATCH', r'.'),
]
def tokenize(self):
while self.pos < len(self.source):
match = None
for tok_type, pattern in self.token_spec:
regex = re.compile(pattern)
match = regex.match(self.source, self.pos)
if match:
text = match.group(0)
if tok_type == 'MISMATCH':
raise SyntaxError(f'Unexpected: {text}')
if tok_type != 'SKIP':
self.tokens.append((tok_type, text))
self.pos = match.end()
break
return self.tokens
code = 'x = 5 + 3;'
lexer = Lexer(code)
tokens = lexer.tokenize()
print(tokens)Expected output:
[('IDENT', 'x'), ('ASSIGN', '='), ('NUMBER', '5'), ('PLUS', '+'), ('NUMBER', '3'), ('SEMI', ';')]Phase 2: Parsing
The parser takes tokens and builds an Abstract Syntax Tree (AST). There are two main approaches:
- Top-down (recursive descent): easy to write by hand, starts from the start symbol
- Bottom-up (LR): handles more grammars, used by Yacc/Bison
Recursive Descent Parser
Each grammar rule becomes a function. The parser follows the grammar’s production rules recursively.
class ASTNode:
def __init__(self, type, value=None, children=None):
self.type = type
self.value = value
self.children = children or []
def __repr__(self):
if self.type == 'number':
return str(self.value)
return f'({self.value} {" ".join(repr(c) for c in self.children)})'
class Parser:
def __init__(self, tokens):
self.tokens = tokens
self.pos = 0
def peek(self):
return self.tokens[self.pos] if self.pos < len(self.tokens) else None
def consume(self, expected=None):
tok = self.peek()
if expected and tok and tok[0] != expected:
raise SyntaxError(f'Expected {expected}, got {tok[0]}')
self.pos += 1
return tok
# expr ::= term (('+' | '-') term)*
def parse_expr(self):
left = self.parse_term()
while self.peek() and self.peek()[0] in ('PLUS', 'MINUS'):
op = self.consume()
right = self.parse_term()
left = ASTNode('binop', op[1], [left, right])
return left
# term ::= factor (('*' | '/') factor)*
def parse_term(self):
left = self.parse_factor()
while self.peek() and self.peek()[0] in ('MUL', 'DIV'):
op = self.consume()
right = self.parse_factor()
left = ASTNode('binop', op[1], [left, right])
return left
# factor ::= NUMBER | '(' expr ')'
def parse_factor(self):
tok = self.peek()
if tok[0] == 'NUMBER':
self.consume('NUMBER')
return ASTNode('number', int(tok[1]))
elif tok[0] == 'LPAREN':
self.consume('LPAREN')
expr = self.parse_expr()
self.consume('RPAREN')
return expr
raise SyntaxError(f'Unexpected token: {tok}')
def parse(self):
return self.parse_expr()
tokens = [('NUMBER','3'),('PLUS','+'),('NUMBER','5'),('MUL','*'),('NUMBER','2')]
parser = Parser(tokens)
ast = parser.parse()
print(f'AST: {ast}')Expected output:
AST: (+ 3 (* 5 2))Operator Precedence
The parser above gives * higher precedence than + because parse_expr calls parse_term first. This correctly evaluates 3 + 5 * 2 as 3 + (5 * 2).
LL vs LR Parsing
| Feature | LL (top-down) | LR (bottom-up) |
|---|---|---|
| Direction | Left-to-right, leftmost derivation | Left-to-right, rightmost derivation |
| Grammar | Must be LL(k) (no left recursion) | Handles left recursion |
| Table size | Smaller | Larger |
| Errors | Early detection | Detects later |
| Tools | ANTLR, handwritten | Yacc, Bison, Bison++ |
Phase 3: Semantic Analysis
The semantic analyser checks meaning beyond syntax — type checking, scope resolution, and declaration verification.
class SymbolTable:
def __init__(self):
self.symbols = {}
self.parent = None
def define(self, name, var_type):
self.symbols[name] = var_type
def lookup(self, name):
if name in self.symbols:
return self.symbols[name]
if self.parent:
return self.parent.lookup(name)
return None
def enter_scope(self):
child = SymbolTable()
child.parent = self
return child
def type_check(node, symtab):
if node.type == 'number':
return 'int'
elif node.type == 'ident':
var_type = symtab.lookup(node.value)
if not var_type:
raise TypeError(f'Undefined variable: {node.value}')
return var_type
elif node.type == 'binop':
left_type = type_check(node.children[0], symtab)
right_type = type_check(node.children[1], symtab)
if left_type != right_type:
raise TypeError(
f'Type mismatch: {left_type} vs {right_type}')
return left_type
elif node.type == 'assign':
var_type = symtab.lookup(node.children[0].value)
val_type = type_check(node.children[1], symtab)
if var_type != val_type:
raise TypeError(
f'Cannot assign {val_type} to {var_type}')
return var_type
symtab = SymbolTable()
symtab.define('x', 'int')
symtab.define('y', 'string')
ast1 = ASTNode('assign', '=', [
ASTNode('ident', 'x'),
ASTNode('number', 42),
])
print(type_check(ast1, symtab)) # int
try:
ast2 = ASTNode('assign', '=', [
ASTNode('ident', 'y'),
ASTNode('number', 42),
])
print(type_check(ast2, symtab))
except TypeError as e:
print(f'Type error: {e}')Expected output:
int
Type error: Cannot assign int to stringPhase 4: Intermediate Representation (IR)
The compiler generates an IR — a platform-independent representation. Common forms:
- Three-address code:
t1 = a + b; t2 = t1 * c; - SSA (Static Single Assignment): each variable assigned exactly once
- Stack-based: like Java bytecode
def generate_ir(node, temp_counter=[0]):
if node.type == 'number':
return str(node.value)
elif node.type == 'binop':
left = generate_ir(node.children[0], temp_counter)
right = generate_ir(node.children[1], temp_counter)
temp_counter[0] += 1
t = f't{temp_counter[0]}'
print(f'{t} = {left} {node.value} {right}')
return t
print('Three-address code for (3 + 5) * 2:')
ast = ASTNode('binop', '*', [
ASTNode('binop', '+', [ASTNode('number', 3), ASTNode('number', 5)]),
ASTNode('number', 2),
])
generate_ir(ast)Expected output:
Three-address code for (3 + 5) * 2:
t1 = 3 + 5
t2 = t1 * 2Phase 5: Optimisation Passes
| Optimisation | Description | Example |
|---|---|---|
| Constant folding | Evaluate constant expressions at compile time | 2 + 3 → 5 |
| Dead code elimination | Remove unreachable code | return 1; x = 2; → return 1; |
| Loop unrolling | Duplicate loop body to reduce branching | for i in 0..2: f(i) → f(0); f(1); f(2) |
| Common subexpression elimination | Reuse computed values | a = x + y; b = x + y → t = x + y; a = t; b = t |
| Inline expansion | Replace function call with body | square(3) → 3 * 3 |
def constant_fold(node):
"""Fold constant expressions in the AST"""
if node.type == 'binop':
left = constant_fold(node.children[0])
right = constant_fold(node.children[1])
if left.type == 'number' and right.type == 'number':
ops = {'+': lambda a, b: a + b,
'-': lambda a, b: a - b,
'*': lambda a, b: a * b,
'/': lambda a, b: a // b}
result = ops[node.value](left.value, right.value)
return ASTNode('number', result)
return ASTNode('binop', node.value, [left, right])
return node
ast = ASTNode('binop', '+', [
ASTNode('binop', '*', [ASTNode('number', 2), ASTNode('number', 3)]),
ASTNode('number', 5),
])
print(f'Before: {ast}')
optimised = constant_fold(ast)
print(f'After: {optimised}')Expected output:
Before: (+ (* 2 3) 5)
After: 11Phase 6: Code Generation
The final phase emits target machine code or bytecode. A simple stack-based VM:
def compile_to_vm(node):
"""Generate stack-based VM instructions"""
code = []
if node.type == 'number':
code.append(f'PUSH {node.value}')
elif node.type == 'binop':
code.extend(compile_to_vm(node.children[0]))
code.extend(compile_to_vm(node.children[1]))
op_map = {'+': 'ADD', '-': 'SUB', '*': 'MUL', '/': 'DIV'}
code.append(op_map[node.value])
return code
ast = ASTNode('binop', '+', [
ASTNode('number', 3),
ASTNode('binop', '*', [
ASTNode('number', 5),
ASTNode('number', 2),
]),
])
instructions = compile_to_vm(ast)
print('VM Instructions:')
for i, instr in enumerate(instructions):
print(f' {i:3d}: {instr}')
print(f'\nResult: {3 + 5 * 2}')Expected output:
VM Instructions:
0: PUSH 3
1: PUSH 5
2: PUSH 2
3: MUL
4: ADD
Result: 13Common Mistakes
1. Left recursion in recursive descent parsers
A rule like expr ::= expr + term causes infinite recursion. Use right recursion or left-factoring instead.
2. Confusing lexer and parser roles
The lexer produces tokens; the parser builds structure. Don’t try to balance parentheses or check syntax in the lexer.
3. Ignoring operator precedence
Without precedence handling, 3 + 5 * 2 evaluates as (3 + 5) * 2. The grammar must encode precedence through non-terminal hierarchy.
4. Not handling errors gracefully
A good compiler reports all errors in one pass and recovers to find more. Stopping at the first error frustrates users.
5. Over-optimising
Aggressive inlining and loop unrolling can bloat binary size and increase compile time with minimal runtime benefit.
6. Forgetting the symbol table in semantic analysis
Without tracking scopes, variable shadowing and type checking break. Each scope needs its own symbol table with parent linkage.
Practice Questions
What does a lexer produce? A stream of tokens — each token is a (type, value) pair representing keywords, identifiers, operators, or literals.
What’s the difference between LL and LR parsing? LL parsers produce a leftmost derivation and work top-down. LR parsers produce a rightmost derivation and work bottom-up. LR handles more grammars.
What is an Abstract Syntax Tree? A tree representation of the program’s syntactic structure, omitting concrete syntax details like parentheses and semicolons.
Why is intermediate code useful? It decouples the frontend (language-specific) from the backend (machine-specific), enabling support for multiple source and target languages.
What is constant folding? Evaluating constant expressions at compile time.
x = 60 * 60 * 24becomesx = 86400, saving runtime computation.
Challenge
Extend the recursive descent parser to support variable declarations (let x = 5;) and print statements (print x;). Generate three-address code for both.
Real-World Task
Run gcc -O2 -S -fopt-info program.c on a small C program. Examine the optimisation dump — what passes ran? How did constant folding or dead code elimination change the output?
FAQ
Mini Project: Simple Expression Compiler
Build a compiler for arithmetic expressions:
- Lexer: tokenises numbers, operators, parentheses
- Parser: builds AST with correct precedence
- Code generator: emits stack-based VM instructions
- VM: executes the instructions and returns the result
Security angle: Compiler hardening (stack canaries, ASLR, CFI) prevents exploits. These are the same protections Durga Antivirus Pro checks during vulnerability scanning.
What’s Next
Before moving on, you should understand:
- The phases of a compiler: lexer → parser → semantic analysis → IR → code generation
- Recursive descent parsing and operator precedence
- How SSA and three-address code represent programs
- Common optimisation passes
Built by the developers of Doda Browser, DodaZIP, and Durga Antivirus Pro.
Built by the developers of DodaTech
Doda Browser, DodaZIP & Durga Antivirus Pro