File Systems Deep Dive — ext4, NTFS, APFS, Btrfs, ZFS & VFS
A file system controls how data is stored, organised, and retrieved on disk. The choice of file system affects performance, reliability, data integrity, and maximum file size.
What You’ll Learn
In this tutorial, you’ll learn how modern file systems work: ext4’s inode structure and journaling, the NTFS Master File Table and B-tree indexing, APFS copy-on-write and snapshots, FAT32/exFAT for portability, Btrfs and ZFS advanced features, the Virtual File System (VFS) abstraction layer, hard links vs symbolic links, and disk partitioning.
Why It Matters
File system corruption means data loss. Understanding how your OS stores files helps you choose the right file system, diagnose disk problems, and recover data. When a database reports “disk full” but there’s space — that’s likely a file system issue.
Real-World Use
When you take a photo on an iPhone, APFS snapshots the file system instantly. When Windows updates, NTFS transactional NTFS ensures the update either completes or rolls back. Durga Antivirus Pro monitors file system events using inotify to scan new files as they’re created.
graph TD
subgraph "VFS (Virtual File System)"
VFS[System Call Interface]
end
subgraph "File System Implementations"
EXT4[ext4]
NTFS[NTFS]
APFS[APFS]
BTRFS[Btrfs]
ZFS[ZFS]
FAT[FAT32/exFAT]
end
subgraph "Block Layer"
BLK[Block Device Layer]
end
subgraph "Storage"
SSD[SSD / NVMe]
HDD[HDD]
NVMe[NVMe]
end
VFS --> EXT4
VFS --> NTFS
VFS --> APFS
VFS --> BTRFS
VFS --> ZFS
VFS --> FAT
EXT4 --> BLK
NTFS --> BLK
APFS --> BLK
BLK --> SSD
BLK --> HDD
BLK --> NVMe
ext4 — Extended File System
ext4 is the default file system for most Linux distributions. It uses inodes to store file metadata and extents for block allocation.
Inodes
Each file has an inode containing metadata: permissions, timestamps, owner, size, and pointers to data blocks. The filename is stored separately in the directory entry.
import struct
import time
class Ext4Inode:
def __init__(self, inode_num, mode, size, blocks):
self.inode_num = inode_num
self.mode = mode # file type + permissions
self.size = size
self.blocks = blocks
self.atime = time.time()
self.mtime = time.time()
self.ctime = time.time()
self.links_count = 1
def __repr__(self):
return (f'Inode {self.inode_num}: '
f'mode={oct(self.mode)} size={self.size}B '
f'blocks={self.blocks}')
class Ext4Directory:
def __init__(self):
self.entries = {} # name -> inode_num
def create_file(self, name, inode_num, size=0):
self.entries[name] = inode_num
return Ext4Inode(inode_num, 0o100644, size, (size + 4095) // 4096)
def ls(self):
for name, inode in sorted(self.entries.items(),
key=lambda x: x[0]):
print(f'{name:20s} → inode {inode}')
# Simulate creating files
root = Ext4Directory()
inodes = {}
inodes[1] = root.create_file('readme.txt', 1, 2048)
inodes[2] = root.create_file('script.sh', 2, 512)
inodes[3] = root.create_file('data.csv', 3, 16384)
inodes[4] = Ext4Inode(4, 0o100755, 12345, 4)
root.ls()
print(f'\n{inodes[3]}')
print(f'Blocks on disk: {inodes[3].blocks}')Expected output:
data.csv → inode 3
readme.txt → inode 1
script.sh → inode 2
Inode 4: mode=0o100755 size=12345B blocks=4Journaling
ext3/ext4 use journaling to prevent corruption after crashes. Before writing data, metadata changes are written to the journal. If a crash occurs, the journal is replayed on mount.
- Journal mode: all data AND metadata journaled (slowest, safest)
- Ordered mode (default): metadata journaled; data written first
- Writeback mode: only metadata journaled (fastest, risk of stale data)
NTFS — New Technology File System
NTFS is the primary file system for Windows. It uses a Master File Table (MFT) — a B-tree of file records.
MFT Structure
Every file and directory has one or more MFT entries (typically 1KB each). Small files fit entirely within the MFT record (resident data). Larger files use extents (non-resident).
class NTFSMFTEntry:
def __init__(self, record_num, filename, is_directory=False):
self.record_num = record_num
self.filename = filename
self.is_directory = is_directory
self.attributes = {}
self.resident_data = b''
def add_attribute(self, attr_type, data, resident=True):
self.attributes[attr_type] = {
'type': attr_type,
'resident': resident,
'size': len(data) if isinstance(data, bytes) else data
}
if resident and isinstance(data, bytes):
self.resident_data = data
def __repr__(self):
attrs = ', '.join(self.attributes.keys())
return (f'MFT Entry {self.record_num}: '
f'{self.filename} [{attrs}]')
# Simulate MFT entries
mft = {}
mft[0] = NTFSMFTEntry(0, '$MFT')
mft[0].add_attribute('STANDARD_INFORMATION', '...')
mft[0].add_attribute('FILE_NAME', '$MFT')
mft[5] = NTFSMFTEntry(5, 'document.docx')
mft[5].add_attribute('STANDARD_INFORMATION', '...')
mft[5].add_attribute('FILE_NAME', 'document.docx')
mft[5].add_attribute('DATA', 1024 * 1024, resident=False) # 1MB
mft[6] = NTFSMFTEntry(6, 'notes.txt')
mft[6].add_attribute('STANDARD_INFORMATION', '...')
mft[6].add_attribute('FILE_NAME', 'notes.txt')
mft[6].add_attribute('DATA', b'Hello, NTFS!', resident=True)
for entry in mft.values():
print(entry)Expected output:
MFT Entry 0: $MFT [STANDARD_INFORMATION, FILE_NAME]
MFT Entry 5: document.docx [STANDARD_INFORMATION, FILE_NAME, DATA]
MFT Entry 6: notes.txt [STANDARD_INFORMATION, FILE_NAME, DATA]APFS — Apple File System
APFS is the default on macOS and iOS. Key features:
- Copy-on-write (CoW): when data is modified, the new data is written to a new block. The old block is freed only after all references are removed.
- Snapshots: instant point-in-time read-only copies of the volume
- Space sharing: multiple volumes share the same free space pool
- Cloning: instant copies of files without duplicating data blocks
class APFSBlock:
def __init__(self, block_id, data=b''):
self.block_id = block_id
self.data = data
self.ref_count = 1
class APFSFile:
def __init__(self, name, data_blocks):
self.name = name
self.data_blocks = data_blocks # list of block IDs
def modify(self, offset, new_data, block_allocator):
"""Copy-on-write: allocate new block, leave old block intact"""
old_id = self.data_blocks[offset]
new_id = block_allocator.allocate(new_data)
self.data_blocks[offset] = new_id
block_allocator.decrement_ref(old_id)
return old_id # old block is preserved if cloned
class APFSBlockAllocator:
def __init__(self):
self.blocks = {}
self.next_id = 0
def allocate(self, data):
block = APFSBlock(self.next_id, data)
self.blocks[self.next_id] = block
self.next_id += 1
return block.block_id
def decrement_ref(self, block_id):
self.blocks[block_id].ref_count -= 1
if self.blocks[block_id].ref_count == 0:
print(f' Freed block {block_id} (no more references)')
# Simulate CoW
allocator = APFSBlockAllocator()
file = APFSFile('document.txt', [
allocator.allocate(b'Hello World!'),
allocator.allocate(b'Second block'),
])
print(f'Before modification: blocks {file.data_blocks}')
old_block = file.modify(0, b'Modified data', allocator)
print(f'After CoW modification: blocks {file.data_blocks}')
print(f'Old block {old_block} preserved (might be referenced by snapshot)')Expected output:
Before modification: blocks [0, 1]
After CoW modification: blocks [2, 1]
Old block 0 preserved (might be referenced by snapshot)FAT32 and exFAT
FAT32 (File Allocation Table, 32-bit) is compatible with almost every OS but limited to 4GB max file size.
exFAT extends FAT for large files and flash media (SD cards >32GB). No journaling, widely supported, recommended for USB drives shared across Windows/macOS/Linux.
Btrfs and ZFS
Both are advanced copy-on-write file systems with built-in volume management.
| Feature | Btrfs | ZFS |
|---|---|---|
| Copy-on-write | Yes | Yes |
| Snapshots | Yes (read/write) | Yes (read/write) |
| Compression | lzo, zstd, zlib | lz4, gzip, zle |
| RAID | 0, 1, 5, 6, 10 | 0, 1, 5, 6, 10, mirror, triple |
| Deduplication | Yes | Yes |
| Checksumming | CRC-32C | Fletcher-4, SHA-256 |
| Max volume size | 16 EiB | 256 ZiB |
VFS — Virtual File System
VFS provides a common interface for all file systems. System calls like open(), read(), write() go through VFS, which dispatches to the specific file system’s implementation.
class VFSNode:
def __init__(self, name, is_directory=False):
self.name = name
self.is_directory = is_directory
self.children = {}
self.data = b''
def open(self, path):
parts = path.strip('/').split('/')
node = self
for part in parts:
if part in node.children:
node = node.children[part]
else:
raise FileNotFoundError(path)
return node
def read(self):
if self.is_directory:
return list(self.children.keys())
return self.data
def write(self, data):
self.data = data
return len(data)
# Simulate VFS
root = VFSNode('/', is_directory=True)
home = VFSNode('home', is_directory=True)
root.children['home'] = home
readme = VFSNode('readme.txt')
readme.write(b'Hello from VFS!')
home.children['readme.txt'] = readme
node = root.open('/home/readme.txt')
print(f'Opened: /home/{node.name}')
print(f'Content: {node.read()}')
node2 = root.open('/home')
print(f'Directory listing: {node2.read()}')Expected output:
Opened: /home/readme.txt
Content: b'Hello from VFS!'
Directory listing: ['readme.txt']Hard Links vs Symbolic Links
| Feature | Hard Link | Symbolic Link |
|---|---|---|
| Points to | Inode | Path |
| Across file systems | No | Yes |
| Directory links | No (usually) | Yes |
| Orphan if target deleted | Data still accessible | Broken link |
| Size | Same as original | Path length |
import os
class SimulatedFS:
def __init__(self):
self.inodes = {}
self.dir_entries = {}
def create_file(self, path, data):
inode_num = len(self.inodes) + 1
self.inodes[inode_num] = {'data': data, 'links': 1}
self.dir_entries[path] = inode_num
return inode_num
def hard_link(self, src, dst):
if src in self.dir_entries:
inode = self.dir_entries[src]
self.dir_entries[dst] = inode
self.inodes[inode]['links'] += 1
print(f'Hard link: {dst} → {src} (same inode {inode})')
def sym_link(self, src, dst):
# Symbolic link stores the path string
self.dir_entries[dst] = f'SYMLINK→{src}'
print(f'Sym link: {dst} → {src} (path: {src})')
def delete(self, path):
entry = self.dir_entries.get(path)
if isinstance(entry, int):
self.inodes[entry]['links'] -= 1
if self.inodes[entry]['links'] == 0:
del self.inodes[entry]
print(f'Inode {entry} freed')
del self.dir_entries[path]
fs = SimulatedFS()
fs.create_file('/original.txt', b'Hello!')
fs.hard_link('/original.txt', '/hardlink.txt')
fs.sym_link('/original.txt', '/symlink.txt')
print(f'\nDelete /original.txt')
fs.delete('/original.txt')
print(f'Hard link still works: inode {fs.dir_entries["/hardlink.txt"]}')
print(f'Sym link: {fs.dir_entries["/symlink.txt"]} (broken!)')Common Mistakes
1. Confusing inodes and filenames
An inode stores metadata; the filename is in the directory entry. Multiple filenames (hard links) can point to the same inode.
2. Using FAT32 for files larger than 4GB
FAT32 has a 4GB maximum file size. Use exFAT or NTFS for large files on external drives.
3. Not considering file system in database performance
Databases on ext4 with ordered mode may get better performance than on CoW file systems (Btrfs/ZFS) unless the CoW system is tuned.
4. Running out of inodes
A file system with 1M inodes can’t create more files even with free space. ext4 reserves enough by default, but small partitions may run out.
5. Ignoring snapshots on CoW file systems
ZFS/Btrfs snapshots use space until deleted. Running out of space despite “free” data is often caused by retained snapshots.
6. Using hard links across file systems
Hard links can’t cross file system boundaries. Use symbolic links for cross-filesystem references.
Practice Questions
What’s stored in an ext4 inode vs a directory entry? An inode stores metadata (permissions, timestamps, block pointers). A directory entry maps a filename to an inode number.
How does NTFS MFT work? The MFT is a B-tree of file records. Each file has at least one record containing attributes (name, data, security, etc.). Small files store data directly in the MFT record.
What is copy-on-write in APFS? When data is modified, APFS writes new data to a new block instead of overwriting. The old block is preserved for snapshots and cloning.
Why doesn’t FAT32 support files larger than 4GB? FAT32 uses 32-bit fields for file size, with a maximum value of 2³²-1 = 4,294,967,295 bytes (~4GB).
What’s the difference between VFS and a file system? VFS is the kernel abstraction layer that provides a common interface for all file systems. The file system is the specific implementation (ext4, NTFS).
Challenge
Implement a simple CoW file system in Python with snapshots. When a snapshot is taken, preserve all referenced data blocks. On file modification, allocate new blocks for changed data. Deleting a block should only free it when no snapshot references it.
Real-World Task
On Linux, run stat /etc/passwd to see an inode firsthand. Then use df -i / to check inode usage. Run mount | grep '^/' to see which file systems are mounted.
FAQ
Mini Project: File System Simulator
Build a simulated file system that:
- Supports create, read, write, delete, hard link, symlink operations
- Uses inode-based metadata storage
- Implements a simple journal for crash recovery
- Reports space usage and fragmentation
Security angle: File system auditing (monitoring creates, deletes, and permission changes) is critical for intrusion detection. Durga Antivirus Pro uses file system filtering to scan files on creation and detect ransomware behaviour patterns.
What’s Next
Before moving on, you should understand:
- ext4 inode/extent structure and journaling modes
- NTFS MFT B-tree organisation
- APFS copy-on-write and snapshot semantics
- VFS abstraction layer
- Hard vs symbolic link differences
Built by the developers of Doda Browser, DodaZIP, and Durga Antivirus Pro.
Built by the developers of DodaTech
Doda Browser, DodaZIP & Durga Antivirus Pro