A great talk by Dan Luu on files. Dan also has two great posts on his website Files are fraught with peril and Filesystem error handling. My key takeaways.
Files are literally impossible to use safely and correctly on modern computers.
If you really care about data integrity, you should put the data in a database instead of on disk. The database designers almost certainly put more thought and care into ensuring data integrity on disk than you have time to do. sqlite
is particularly good.
On big problem is fsync
, the command normally used to flush a file’s dirty pages to stable storage. This command serves the dual purpose of working as a fence instruction to prevent reordering or writes, but it is also an instruction to clear all caches. Programmers often need to prevent reordering or writes, but rarely want to clear all caches. The result is that code often needs to liberally call fsync
in order to be correct, but code that liberally calls fsync
often has terrible performance.
Personally, I (Jon Shea, not Dan Luu) have found the sqlite “Technical and Design Documentation” to be exceptional. The doc for “Atomic Commit In SQLite” is the only in-depth tutorial I have ever found on designing practical, correct on-disk data structures, and “SQLite Query Optimizer Overview” is an outstanding introduction to database query optimizers.
Two misguided workarounds that are commonly suggested are (1) write your data in a new file and rename
it into place, or (2) only append
to existing files. Neither of these actions are safe. rename
is usually atomic when the source and destination reside on the same filesystem during normal operation, but but it offers no atomicity guarantee across sudden power loss and can fail if the destination already exists. append
neither prevents reordering nor guarantees atomicity on crashes.