The System is The Solution
Alternative title: “My answer for the next decade of my digital life”.
I am, generally speaking, a data hoarder. I don't really have the money to invest in large ZFS or CephFS or whateverFS clusters of storage, so I try my best with my 1TB drive I use for my entire system (nowadays with a 256GB NVMe drive for the root partition, so another 100GB can be quickly eaten away by datahoard-ness).
If I recall correctly, this strategy of doing things started back in the 2010s where broadband was still shy inside the country (10Mbps down, less than 1Mbps up was common, nowadays in 2022, I'm getting almost 300Mbps down, 40Mbps up, with the same pricing from back then), and a girl needed her electronic music fix. Since my school doesn't have a connection, I started hoarding youtube mp3 files from a random youtube-to-mp3 service, and copying the files to my 2GB phone.
My laptop had way more than 2GB, and there's an infinite amount of music to listen to. What to do? Organizational systems, of course! The first thing I remember doing was ordering folders by increments of 1, where each folder had some 10 to 20 files. So I could grow the library up in my laptop, and copy the most recent tunes to my phone in a very easy manner (delete all folders below 25, for example), while not having to lose the old stuff, just in case. If you know my soulseek, you can see that same system living in 2022.
As time passed, I started gathering different kinds of media, videos, papers, books, and I noticed a pattern: they aren't properly read in sequential order. Say, if I wanted to find some scientific paper I had saved ages ago, and I know what it's about, I would have to either: – Go through each folder sequentially. – Hope that I remember the title.
If you're yelling “booru” out of the top of your mind, then yes, I didn't know at the time, even though I was a heavy user of booru software to search for new images of anime women holding hands, but booru systems are the ideal solution for this kind of problem. Non-hierarchical, tagging systems that have (digitally) lived with us for a decade now, but the ideas behind it come much earlier.
I'm ashamed that it took this long for me to refer to it in writing, but Nayuki's blogpost on non-hierarchical systems is an eye opener to what I've been wanting to have for ages and I didn't know it. It's absurdly long, but it's orchestrated akin to a machine gun of ideas, and I love that kind of stuff.
There are many booru systems out there, but one of them that's close to what I've been wanting is Hydrus. From the website: > The hydrus network client is a desktop application written for Anonymous and other internet enthusiasts with large media collections. It organises your files into an internal database and browses them with tags instead of folders, a little like a booru on your desktop. I have attempted to use it to organize my libraries more than once, since it would “technically” fit like a glove, but I had issues with it, though they are not to say that hydrus is bad, more that it doesn't align with my vision for such a system.
In hydrus, you import your files into the system, you can add/remove tags, share metadata around in the “hydrus network”, etc. But the biggest dealbreaker for me is that once a file is added, its original location becomes meaningless. To be able to refer to that file in the future, you must use hydrus to find it, because everything is in an internal
client_files/ directory where the filenames are renamed to their hashes.
Everything still works from a filesystem level, sure, but if you lose access to Hydrus, you now get a
client_files/ folder you can't understand anything out of by filepath anymore. That is a design decision that I perfectly understand where it comes from, but it brings me pain (see: the folder organization structure I just showed is not possible to happen inside Hydrus unless I create hacks like symbolic links from
client_files/ into the sequential folder structure).
So, if I wanted to make my own non-hierarchical system, it would have to operate as an overlay on top of an existing filesystem, keeping references to the original file paths. That becomes a problem really fast, and that's the main reason why hydrus does what it does by design: renames. If a file is renamed, your reference to it goes away. A system that does not take ownership of the file contents entirely would have to keep track of path renaming.
There are two approaches I have found for this: – FUSE. – syscall tracing.
FUSE is something I have never touched on and there's the possibility of causing high latency on FS operations as things go back and forth from kernel-space to user-space (in theory
io_uring could help in this case but I have no idea how it works).
Syscall tracing is possible thanks to the eBPF virtual machine that's in the Linux Kernel.
bpftrace is a CLI utility for Linux that is heavily inspired by dtrace which works on the BSDs and illumos systems.
So, well. I guess I made it.
Here it is, thanks to
bpftrace integration I am able to make my own “rename tracker” and update the index database file automatically. Because of all of that, I have a much harsher opinion on
bpftrace, but that's answered by “it's not 1.0 yet, so don't complain”.
awtfdb is a collection of CLI tools (at the moment, Linux-only, MacOS soon maybe?) that operate on a single index database file, powered mainly by Zig, SQLite, and my power to bother a colleague so hard they make a library. You can include files and their tags into the index with
ainclude, search for tags with
afind, remove files through
arm, etc. Since
bpftrace has to be run as root,
awtfdb-watcher is provided so that only that dedicated process can have root access, while the others just operate under your normal user.
This goes back to the alternative title. If all goes well, this project is planned to stay around for a long time for me, and is how I want to manage my media libraries in the long far future. It isn't super stable right now, and there's design questions to answer, but I'm hopeful for that future.