File Systems
A filesystem is the answer to the question: how do you store and retrieve named data on a device that only understands block reads and writes?
The Core Abstraction
A filesystem provides a hierarchical namespace (directories and files) over a flat sequence
of fixed-size blocks on a storage device. The kernel’s VFS (Virtual File System) layer
provides a uniform interface (open, read, write, stat) regardless of the underlying filesystem.
Inodes
Every file and directory is represented by an inode — a data structure storing metadata: permissions, owner, timestamps, size, and crucially, pointers to the data blocks.
The inode does not store the filename. Filenames live in directory entries (dentries), which map names → inode numbers. This is why hard links work: two dentries pointing to the same inode.
Directory entry: "main.c" → inode 42
Inode 42: mode=0644, size=1337, blocks=[block_104, block_105, ...]
Block 104: (first 4096 bytes of file data)
Extent-Based vs Block-Map
- Block map (ext2) — inode contains direct, indirect, double-indirect block pointers. Simple but slow for large files.
- Extents (ext4) — inode contains extents:
(start_block, length)pairs. Far fewer metadata lookups for large contiguous files.
Journaling
Without journaling, a crash mid-write can leave the filesystem in an inconsistent state (e.g. inode updated but block bitmap not). A journal (write-ahead log) records intended changes before applying them. On recovery, incomplete transactions are either replayed or discarded.
ext4 supports three journaling modes:
- journal — both data and metadata journaled (slowest, safest)
- ordered — metadata journaled, data flushed first (default)
- writeback — metadata journaled only (fastest, least safe)
The VFS Layer
Linux’s VFS defines a set of function pointers (inode_operations, file_operations,
super_operations) that each filesystem must implement. This is why read() works
identically on ext4, tmpfs, procfs, and network filesystems.
procfs and sysfs
Not all filesystems store data on disk. /proc and /sys are virtual filesystems:
they generate their content dynamically from kernel data structures.
/proc/PID/maps shows a process’s memory map; /sys/block/sda/stat shows disk I/O stats.
Open Questions
- How does copy-on-write (Btrfs, ZFS) change the journaling model?
- What is
fsync()actually doing, and why is it so expensive? - How does
mmapinteract with the page cache for file-backed mappings?