Learn Operating: File Systems Deep Dive — ext4, NTFS, APFS, Btrfs, ZFS & VFS

File Systems Deep Dive — ext4, NTFS, APFS, Btrfs, ZFS & VFS

DodaTech Updated Jun 20, 2026 10 min read

A file system controls how data is stored, organised, and retrieved on disk. The choice of file system affects performance, reliability, data integrity, and maximum file size.

What You’ll Learn

In this tutorial, you’ll learn how modern file systems work: ext4’s inode structure and journaling, the NTFS Master File Table and B-tree indexing, APFS copy-on-write and snapshots, FAT32/exFAT for portability, Btrfs and ZFS advanced features, the Virtual File System (VFS) abstraction layer, hard links vs symbolic links, and disk partitioning.

Why It Matters

File system corruption means data loss. Understanding how your OS stores files helps you choose the right file system, diagnose disk problems, and recover data. When a database reports “disk full” but there’s space — that’s likely a file system issue.

Real-World Use

When you take a photo on an iPhone, APFS snapshots the file system instantly. When Windows updates, NTFS transactional NTFS ensures the update either completes or rolls back. Durga Antivirus Pro monitors file system events using inotify to scan new files as they’re created.

    graph TD
  subgraph "VFS (Virtual File System)"
    VFS[System Call Interface]
  end
  subgraph "File System Implementations"
    EXT4[ext4]
    NTFS[NTFS]
    APFS[APFS]
    BTRFS[Btrfs]
    ZFS[ZFS]
    FAT[FAT32/exFAT]
  end
  subgraph "Block Layer"
    BLK[Block Device Layer]
  end
  subgraph "Storage"
    SSD[SSD / NVMe]
    HDD[HDD]
    NVMe[NVMe]
  end
  VFS --> EXT4
  VFS --> NTFS
  VFS --> APFS
  VFS --> BTRFS
  VFS --> ZFS
  VFS --> FAT
  EXT4 --> BLK
  NTFS --> BLK
  APFS --> BLK
  BLK --> SSD
  BLK --> HDD
  BLK --> NVMe

ext4 — Extended File System

ext4 is the default file system for most Linux distributions. It uses inodes to store file metadata and extents for block allocation.

Inodes

Each file has an inode containing metadata: permissions, timestamps, owner, size, and pointers to data blocks. The filename is stored separately in the directory entry.

import struct
import time

class Ext4Inode:
    def __init__(self, inode_num, mode, size, blocks):
        self.inode_num = inode_num
        self.mode = mode  # file type + permissions
        self.size = size
        self.blocks = blocks
        self.atime = time.time()
        self.mtime = time.time()
        self.ctime = time.time()
        self.links_count = 1

    def __repr__(self):
        return (f'Inode {self.inode_num}: '
                f'mode={oct(self.mode)} size={self.size}B '
                f'blocks={self.blocks}')

class Ext4Directory:
    def __init__(self):
        self.entries = {}  # name -> inode_num

    def create_file(self, name, inode_num, size=0):
        self.entries[name] = inode_num
        return Ext4Inode(inode_num, 0o100644, size, (size + 4095) // 4096)

    def ls(self):
        for name, inode in sorted(self.entries.items(),
                                  key=lambda x: x[0]):
            print(f'{name:20s} → inode {inode}')

# Simulate creating files
root = Ext4Directory()
inodes = {}
inodes[1] = root.create_file('readme.txt', 1, 2048)
inodes[2] = root.create_file('script.sh', 2, 512)
inodes[3] = root.create_file('data.csv', 3, 16384)
inodes[4] = Ext4Inode(4, 0o100755, 12345, 4)

root.ls()
print(f'\n{inodes[3]}')
print(f'Blocks on disk: {inodes[3].blocks}')

Expected output:

data.csv             → inode 3
readme.txt           → inode 1
script.sh            → inode 2

Inode 4: mode=0o100755 size=12345B blocks=4

Journaling

ext3/ext4 use journaling to prevent corruption after crashes. Before writing data, metadata changes are written to the journal. If a crash occurs, the journal is replayed on mount.

Journal mode: all data AND metadata journaled (slowest, safest)
Ordered mode (default): metadata journaled; data written first
Writeback mode: only metadata journaled (fastest, risk of stale data)

NTFS — New Technology File System

NTFS is the primary file system for Windows. It uses a Master File Table (MFT) — a B-tree of file records.

MFT Structure

Every file and directory has one or more MFT entries (typically 1KB each). Small files fit entirely within the MFT record (resident data). Larger files use extents (non-resident).

class NTFSMFTEntry:
    def __init__(self, record_num, filename, is_directory=False):
        self.record_num = record_num
        self.filename = filename
        self.is_directory = is_directory
        self.attributes = {}
        self.resident_data = b''

    def add_attribute(self, attr_type, data, resident=True):
        self.attributes[attr_type] = {
            'type': attr_type,
            'resident': resident,
            'size': len(data) if isinstance(data, bytes) else data
        }
        if resident and isinstance(data, bytes):
            self.resident_data = data

    def __repr__(self):
        attrs = ', '.join(self.attributes.keys())
        return (f'MFT Entry {self.record_num}: '
                f'{self.filename} [{attrs}]')

# Simulate MFT entries
mft = {}
mft[0] = NTFSMFTEntry(0, '$MFT')
mft[0].add_attribute('STANDARD_INFORMATION', '...')
mft[0].add_attribute('FILE_NAME', '$MFT')

mft[5] = NTFSMFTEntry(5, 'document.docx')
mft[5].add_attribute('STANDARD_INFORMATION', '...')
mft[5].add_attribute('FILE_NAME', 'document.docx')
mft[5].add_attribute('DATA', 1024 * 1024, resident=False)  # 1MB

mft[6] = NTFSMFTEntry(6, 'notes.txt')
mft[6].add_attribute('STANDARD_INFORMATION', '...')
mft[6].add_attribute('FILE_NAME', 'notes.txt')
mft[6].add_attribute('DATA', b'Hello, NTFS!', resident=True)

for entry in mft.values():
    print(entry)

Expected output:

MFT Entry 0: $MFT [STANDARD_INFORMATION, FILE_NAME]
MFT Entry 5: document.docx [STANDARD_INFORMATION, FILE_NAME, DATA]
MFT Entry 6: notes.txt [STANDARD_INFORMATION, FILE_NAME, DATA]

APFS — Apple File System

APFS is the default on macOS and iOS. Key features:

Copy-on-write (CoW): when data is modified, the new data is written to a new block. The old block is freed only after all references are removed.
Snapshots: instant point-in-time read-only copies of the volume
Space sharing: multiple volumes share the same free space pool
Cloning: instant copies of files without duplicating data blocks

class APFSBlock:
    def __init__(self, block_id, data=b''):
        self.block_id = block_id
        self.data = data
        self.ref_count = 1

class APFSFile:
    def __init__(self, name, data_blocks):
        self.name = name
        self.data_blocks = data_blocks  # list of block IDs

    def modify(self, offset, new_data, block_allocator):
        """Copy-on-write: allocate new block, leave old block intact"""
        old_id = self.data_blocks[offset]
        new_id = block_allocator.allocate(new_data)
        self.data_blocks[offset] = new_id
        block_allocator.decrement_ref(old_id)
        return old_id  # old block is preserved if cloned

class APFSBlockAllocator:
    def __init__(self):
        self.blocks = {}
        self.next_id = 0

    def allocate(self, data):
        block = APFSBlock(self.next_id, data)
        self.blocks[self.next_id] = block
        self.next_id += 1
        return block.block_id

    def decrement_ref(self, block_id):
        self.blocks[block_id].ref_count -= 1
        if self.blocks[block_id].ref_count == 0:
            print(f'  Freed block {block_id} (no more references)')

# Simulate CoW
allocator = APFSBlockAllocator()
file = APFSFile('document.txt', [
    allocator.allocate(b'Hello World!'),
    allocator.allocate(b'Second block'),
])

print(f'Before modification: blocks {file.data_blocks}')
old_block = file.modify(0, b'Modified data', allocator)
print(f'After CoW modification: blocks {file.data_blocks}')
print(f'Old block {old_block} preserved (might be referenced by snapshot)')

Expected output:

Before modification: blocks [0, 1]
After CoW modification: blocks [2, 1]
Old block 0 preserved (might be referenced by snapshot)

FAT32 and exFAT

FAT32 (File Allocation Table, 32-bit) is compatible with almost every OS but limited to 4GB max file size.

exFAT extends FAT for large files and flash media (SD cards >32GB). No journaling, widely supported, recommended for USB drives shared across Windows/macOS/Linux.

Btrfs and ZFS

Both are advanced copy-on-write file systems with built-in volume management.

Feature	Btrfs	ZFS
Copy-on-write	Yes	Yes
Snapshots	Yes (read/write)	Yes (read/write)
Compression	lzo, zstd, zlib	lz4, gzip, zle
RAID	0, 1, 5, 6, 10	0, 1, 5, 6, 10, mirror, triple
Deduplication	Yes	Yes
Checksumming	CRC-32C	Fletcher-4, SHA-256
Max volume size	16 EiB	256 ZiB

VFS — Virtual File System

VFS provides a common interface for all file systems. System calls like open(), read(), write() go through VFS, which dispatches to the specific file system’s implementation.

class VFSNode:
    def __init__(self, name, is_directory=False):
        self.name = name
        self.is_directory = is_directory
        self.children = {}
        self.data = b''

    def open(self, path):
        parts = path.strip('/').split('/')
        node = self
        for part in parts:
            if part in node.children:
                node = node.children[part]
            else:
                raise FileNotFoundError(path)
        return node

    def read(self):
        if self.is_directory:
            return list(self.children.keys())
        return self.data

    def write(self, data):
        self.data = data
        return len(data)

# Simulate VFS
root = VFSNode('/', is_directory=True)
home = VFSNode('home', is_directory=True)
root.children['home'] = home
readme = VFSNode('readme.txt')
readme.write(b'Hello from VFS!')
home.children['readme.txt'] = readme

node = root.open('/home/readme.txt')
print(f'Opened: /home/{node.name}')
print(f'Content: {node.read()}')

node2 = root.open('/home')
print(f'Directory listing: {node2.read()}')

Expected output:

Opened: /home/readme.txt
Content: b'Hello from VFS!'
Directory listing: ['readme.txt']

Hard Links vs Symbolic Links

Feature	Hard Link	Symbolic Link
Points to	Inode	Path
Across file systems	No	Yes
Directory links	No (usually)	Yes
Orphan if target deleted	Data still accessible	Broken link
Size	Same as original	Path length

import os

class SimulatedFS:
    def __init__(self):
        self.inodes = {}
        self.dir_entries = {}

    def create_file(self, path, data):
        inode_num = len(self.inodes) + 1
        self.inodes[inode_num] = {'data': data, 'links': 1}
        self.dir_entries[path] = inode_num
        return inode_num

    def hard_link(self, src, dst):
        if src in self.dir_entries:
            inode = self.dir_entries[src]
            self.dir_entries[dst] = inode
            self.inodes[inode]['links'] += 1
            print(f'Hard link: {dst} → {src} (same inode {inode})')

    def sym_link(self, src, dst):
        # Symbolic link stores the path string
        self.dir_entries[dst] = f'SYMLINK→{src}'
        print(f'Sym link: {dst} → {src} (path: {src})')

    def delete(self, path):
        entry = self.dir_entries.get(path)
        if isinstance(entry, int):
            self.inodes[entry]['links'] -= 1
            if self.inodes[entry]['links'] == 0:
                del self.inodes[entry]
                print(f'Inode {entry} freed')
        del self.dir_entries[path]

fs = SimulatedFS()
fs.create_file('/original.txt', b'Hello!')
fs.hard_link('/original.txt', '/hardlink.txt')
fs.sym_link('/original.txt', '/symlink.txt')

print(f'\nDelete /original.txt')
fs.delete('/original.txt')
print(f'Hard link still works: inode {fs.dir_entries["/hardlink.txt"]}')
print(f'Sym link: {fs.dir_entries["/symlink.txt"]} (broken!)')

Common Mistakes

1. Confusing inodes and filenames

An inode stores metadata; the filename is in the directory entry. Multiple filenames (hard links) can point to the same inode.

2. Using FAT32 for files larger than 4GB

FAT32 has a 4GB maximum file size. Use exFAT or NTFS for large files on external drives.

3. Not considering file system in database performance

Databases on ext4 with ordered mode may get better performance than on CoW file systems (Btrfs/ZFS) unless the CoW system is tuned.

4. Running out of inodes

A file system with 1M inodes can’t create more files even with free space. ext4 reserves enough by default, but small partitions may run out.

5. Ignoring snapshots on CoW file systems

ZFS/Btrfs snapshots use space until deleted. Running out of space despite “free” data is often caused by retained snapshots.

6. Using hard links across file systems

Hard links can’t cross file system boundaries. Use symbolic links for cross-filesystem references.

Practice Questions

What’s stored in an ext4 inode vs a directory entry? An inode stores metadata (permissions, timestamps, block pointers). A directory entry maps a filename to an inode number.
How does NTFS MFT work? The MFT is a B-tree of file records. Each file has at least one record containing attributes (name, data, security, etc.). Small files store data directly in the MFT record.
What is copy-on-write in APFS? When data is modified, APFS writes new data to a new block instead of overwriting. The old block is preserved for snapshots and cloning.
Why doesn’t FAT32 support files larger than 4GB? FAT32 uses 32-bit fields for file size, with a maximum value of 2³²-1 = 4,294,967,295 bytes (~4GB).
What’s the difference between VFS and a file system? VFS is the kernel abstraction layer that provides a common interface for all file systems. The file system is the specific implementation (ext4, NTFS).

Challenge

Implement a simple CoW file system in Python with snapshots. When a snapshot is taken, preserve all referenced data blocks. On file modification, allocate new blocks for changed data. Deleting a block should only free it when no snapshot references it.

Real-World Task

On Linux, run stat /etc/passwd to see an inode firsthand. Then use df -i / to check inode usage. Run mount | grep '^/' to see which file systems are mounted.

FAQ

What is file system fragmentation?

When a file’s data blocks are scattered across the disk instead of contiguous. ext4 uses extents to reduce fragmentation. NTFS resists fragmentation via B-tree allocation. Defragmentation reorganises data for sequential access speed.

What is TRIM and why does it matter for SSDs?

TRIM tells the SSD which blocks are free, allowing the controller to garbage-collect them. Without TRIM, SSD write performance degrades over time. ext4 supports discard, fstrim, and online TRIM.

What is the maximum file size in ext4?

16 TiB for ext4 with 4KB blocks. The maximum volume size is 1 EiB.

What is a mount point?

A directory where a file system is attached to the VFS tree. /home might be a separate partition mounted at boot. mount /dev/sda1 /mnt attaches a file system to /mnt.

Can I recover data from a formatted drive?

Formatting typically only overwrites the file system metadata, not the data blocks. Tools like TestDisk and PhotoRec can recover data unless the drive was securely wiped.

Mini Project: File System Simulator

Build a simulated file system that:

Supports create, read, write, delete, hard link, symlink operations
Uses inode-based metadata storage
Implements a simple journal for crash recovery
Reports space usage and fragmentation

Security angle: File system auditing (monitoring creates, deletes, and permission changes) is critical for intrusion detection. Durga Antivirus Pro uses file system filtering to scan files on creation and detect ransomware behaviour patterns.

What’s Next

Interprocess Communication — Next Lesson

Virtualization & Containers

Review: Process Scheduling

Before moving on, you should understand:

ext4 inode/extent structure and journaling modes
NTFS MFT B-tree organisation
APFS copy-on-write and snapshot semantics
VFS abstraction layer
Hard vs symbolic link differences

Built by the developers of Doda Browser, DodaZIP, and Durga Antivirus Pro.

Previous Process Scheduling Algorithms — FCFS, SJF, Round Robin & CFS Next Interprocess Communication — Pipes, Message Queues, Shared Memory & Sockets

Built by the developers of DodaTech

Doda Browser, DodaZIP & Durga Antivirus Pro

Home Browse Operating Systems