Compiler Design Explained — Lexical Analysis, Parsing & Code Generation
A compiler translates high-level source code into low-level machine code — performing lexical analysis, parsing, semantic analysis, optimization, and code generation.
What You’ll Learn
In this tutorial, you’ll learn each phase of a compiler, the difference between top-down and bottom-up parsing, how an Abstract Syntax Tree (AST) is built, and how code optimization works.
Why It Matters
Compilers transform the code you write into something a computer can execute. Understanding compiler design helps you write more efficient code, debug obscure errors, and design domain-specific languages. Every framework, linter, and transpiler uses compiler techniques.
Real-World Use
When you write TypeScript, the TypeScript compiler (tsc) lexes, parses, type-checks, and emits JavaScript. Python’s CPython interpreter compiles source to bytecode before execution. Durga Antivirus Pro uses pattern-matching techniques similar to lexical analysis to scan files for threat signatures.
graph LR A[Source Code] --> B[Lexical Analysis] B --> C[Token Stream] C --> D[Syntax Analysis / Parsing] D --> E[Abstract Syntax Tree] E --> F[Semantic Analysis] F --> G[Intermediate Code Gen] G --> H[Optimization] H --> I[Code Generation] I --> J[Machine Code] style A fill:#4f46e5,color:#fff style J fill:#4f46e5,color:#fff
Phase 1: Lexical Analysis
The lexer (or tokenizer) reads the source code character by character and groups them into tokens — meaningful units like keywords, identifiers, operators, and literals.
import re
class Lexer:
def __init__(self, source):
self.source = source
self.pos = 0
self.tokens = []
self.token_patterns = [
("NUMBER", r"\d+(\.\d*)?"),
("PLUS", r"\+"),
("MINUS", r"-"),
("MULTIPLY", r"\*"),
("DIVIDE", r"/"),
("LPAREN", r"\("),
("RPAREN", r"\)"),
("IDENTIFIER", r"[a-zA-Z_][a-zA-Z0-9_]*"),
("SKIP", r"[ \t\n]+"),
("ERROR", r"."),
]
def tokenize(self):
while self.pos < len(self.source):
match = None
for token_type, pattern in self.token_patterns:
regex = re.compile(pattern)
match = regex.match(self.source, self.pos)
if match:
text = match.group(0)
if token_type != "SKIP" and token_type != "ERROR":
self.tokens.append((token_type, text))
elif token_type == "ERROR":
raise SyntaxError(f"Unexpected char: {text}")
self.pos = match.end()
break
return self.tokens
lexer = Lexer("3 + 5 * (10 - 2)")
tokens = lexer.tokenize()
print(tokens)Expected output:
[('NUMBER', '3'), ('PLUS', '+'), ('NUMBER', '5'), ('MULTIPLY', '*'), ('LPAREN', '('), ('NUMBER', '10'), ('MINUS', '-'), ('NUMBER', '2'), ('RPAREN', ')')]Phase 2: Syntax Analysis (Parsing)
The parser takes tokens and builds an Abstract Syntax Tree (AST) — a tree representation of the program’s grammatical structure.
Top-Down Parsing (Recursive Descent)
Starts from the start symbol and applies production rules. Easy to write by hand for simple grammars.
Bottom-Up Parsing (LR)
Starts from the input tokens and reduces them to the start symbol using a stack. Handles more complex grammars. Tools like Yacc and Bison generate bottom-up parsers.
class ASTNode:
def __init__(self, type, value=None, left=None, right=None):
self.type = type
self.value = value
self.left = left
self.right = right
def __repr__(self):
if self.type == "NUMBER":
return str(self.value)
return f"({self.left} {self.value} {self.right})"
class Parser:
def __init__(self, tokens):
self.tokens = tokens
self.pos = 0
def peek(self):
return self.tokens[self.pos] if self.pos < len(self.tokens) else None
def consume(self, expected_type=None):
token = self.peek()
if expected_type and token and token[0] != expected_type:
raise SyntaxError(f"Expected {expected_type}, got {token}")
self.pos += 1
return token
def parse_expression(self):
# expr ::= term ( (+|-) term )*
left = self.parse_term()
while self.peek() and self.peek()[0] in ("PLUS", "MINUS"):
op = self.consume()
right = self.parse_term()
left = ASTNode("BINOP", op[1], left, right)
return left
def parse_term(self):
# term ::= factor ( (*|/) factor )*
left = self.parse_factor()
while self.peek() and self.peek()[0] in ("MULTIPLY", "DIVIDE"):
op = self.consume()
right = self.parse_factor()
left = ASTNode("BINOP", op[1], left, right)
return left
def parse_factor(self):
# factor ::= NUMBER | ( expr )
token = self.peek()
if token[0] == "NUMBER":
self.consume("NUMBER")
return ASTNode("NUMBER", int(token[1]))
elif token[0] == "LPAREN":
self.consume("LPAREN")
expr = self.parse_expression()
self.consume("RPAREN")
return expr
raise SyntaxError(f"Unexpected token: {token}")
def parse(self):
return self.parse_expression()
def evaluate(node):
if node.type == "NUMBER":
return node.value
left = evaluate(node.left)
right = evaluate(node.right)
if node.value == "+": return left + right
if node.value == "-": return left - right
if node.value == "*": return left * right
if node.value == "/": return left / right
tokens = [("NUMBER", "3"), ("PLUS", "+"), ("NUMBER", "5"), ("MULTIPLY", "*"),
("NUMBER", "2")]
parser = Parser(tokens)
ast = parser.parse()
print(f"AST: {ast}")
result = evaluate(ast)
print(f"Result: {result}")Expected output:
AST: (3 + (5 * 2))
Result: 13Phase 3: Semantic Analysis
The semantic analyzer checks for meaning-level errors:
- Type checking (adding string + number?)
- Scope resolution (is this variable defined?)
- Declaration checking (function called before definition?)
Phase 4: Intermediate Code Generation and Optimization
The compiler generates an intermediate representation (IR) — a platform-independent code. Optimization passes then improve the IR:
| Optimization | Description |
|---|---|
| Constant folding | 2 + 3 → 5 at compile time |
| Dead code elimination | Remove unreachable code |
| Loop unrolling | Duplicate loop body to reduce branching |
| Inline expansion | Replace function call with function body |
| Common subexpression elimination | Reuse computed values |
Phase 5: Code Generation
The final phase generates target machine code (x86, ARM, WebAssembly, or JVM bytecode). This involves register allocation, instruction selection, and instruction scheduling.
Common Mistakes
- Confusing lexical and syntax analysis: The lexer produces tokens; the parser builds structure. Don’t try to parse in the lexer.
- Left recursion in top-down parsers: A rule like
expr ::= expr + termcauses infinite recursion. Use left-factoring to eliminate it. - Ignoring operator precedence:
3 + 5 * 2should be3 + (5 * 2), not(3 + 5) * 2. The parser must enforce precedence. - Not handling errors gracefully: A good compiler reports all errors in one pass, not stopping at the first one.
- Over-optimizing: Aggressive optimizations can increase compile time and binary size without measurable runtime benefit.
Practice Questions
What does a lexer do? Converts source code characters into tokens (keywords, identifiers, operators, literals).
What’s the difference between top-down and bottom-up parsing? Top-down (recursive descent) starts from the start symbol and expands. Bottom-up (LR) starts from input and reduces to the start symbol.
What is an AST? An Abstract Syntax Tree represents the grammatical structure of code without concrete syntax details like parentheses or semicolons.
Why is intermediate code useful? It decouples the frontend (language-specific) from the backend (machine-specific), enabling support for multiple source and target languages.
What is constant folding? Evaluating constant expressions at compile time instead of runtime.
x = 60 * 60 * 24becomesx = 86400.
Challenge
Extend the expression parser to support variables and assignment. Add an identifier token type and a symbol table to store variable values.
Real-World Task
Run gcc -O2 -S program.c on a simple C program. Look at the assembly output. Can you spot any optimizations (constant folding, dead code elimination)?
Mini Project: Simple Expression Compiler
Build a four-phase compiler for arithmetic expressions: lexer → parser → code generator → output. Generate stack-based virtual machine instructions (like Java bytecode) for expressions.
Security angle: Compiler hardening techniques (stack canaries, ASLR, control-flow integrity) prevent memory corruption exploits. These are the same protections Durga Antivirus Pro checks for during vulnerability scanning.
What’s Next
Built by the developers of Doda Browser, DodaZIP, and Durga Antivirus Pro.
What’s Next
Congratulations on completing this Compiler Design tutorial! Here’s where to go from here:
- Practice daily — Consistency is more important than long study sessions
- Build a project — Apply what you learned by building something real
- Explore related topics — Check out other tutorials in the same category
- Join the community — Discuss with other learners and share your progress
Remember: every expert was once a beginner. Keep coding!
Built by the developers of DodaTech
Doda Browser, DodaZIP & Durga Antivirus Pro