Too many files: Reiser FS vs hashed paths
On my hard drive, some directories attract lint faster than my CPU fan. I wipe then clean but as soon as I blink they already contain a zillion files. And for some reason, having a zillion files in a directory on GNU/Linux is a really bad thing.
With Gazest, I decided to store only the files meta data in the database and to keep the content on disk. To prevent name clashes, the content filenames are the HMAC-SHA1 hashes of the data. Of course, that means that I have to expect a zillion files in the content directory and as soon as you mention the problem of to many files you hear "Reiser FS," loud and proud from the cube next to yours. But it is really the best solution?
I'll focus on a single problem: access time of a file stored in a directory with a lot of other files. I don't care about CPU usage, disk usage, fragmentation, paging, or recovery tools. When I read a file, how long will it take? Thats it. Surprisingly, I could not find good benchmark for that typical use case. Sure, there are many anecdotal tales of people who noticed a significant improvement with a new FS when they re-installed their mail server but they end up comparing completely different setups with a variable real world server load. Lack of measurements lead to counter productive faith based debates so here is my attempt to quantify the issue.



















