Too many files: Reiser FS vs hashed paths
On my hard drive, some directories attract lint faster than my CPU fan. I wipe then clean but as soon as I blink they already contain a zillion files. And for some reason, having a zillion files in a directory on GNU/Linux is a really bad thing.
With Gazest, I decided to store only the files meta data in the database and to keep the content on disk. To prevent name clashes, the content filenames are the HMAC-SHA1 hashes of the data. Of course, that means that I have to expect a zillion files in the content directory and as soon as you mention the problem of to many files you hear "Reiser FS," loud and proud from the cube next to yours. But it is really the best solution?
I'll focus on a single problem: access time of a file stored in a directory with a lot of other files. I don't care about CPU usage, disk usage, fragmentation, paging, or recovery tools. When I read a file, how long will it take? Thats it. Surprisingly, I could not find good benchmark for that typical use case. Sure, there are many anecdotal tales of people who noticed a significant improvement with a new FS when they re-installed their mail server but they end up comparing completely different setups with a variable real world server load. Lack of measurements lead to counter productive faith based debates so here is my attempt to quantify the issue.
First of all, is there a problem with having many files in the same directory? If you have enough RAM, not really. The Linux kernel caches files attributes when it finds spare RAM. That means that you could ask for a file and instantly know its inode number without hitting the platters. In the initial version of this benchmark, I didn't flush the file system cache and file access time was almost constant no matter how many files there was in a directory. But for most people, there isn't enough RAM so here is the improved benchmark.
For various file systems, I formated a partition then filled it with many 1 Kb files for three different definitions of "many": 216, 218, and 220. Then I did typical operations: create files, delete files, open files (seek only), and read files (seek+read), taking care of flushing the file system cache before each test. Each operation was repeated 16384 (214) times.
The default file system on GNU/Linux is ext3. There was a time when it implemented directory indices with a linked list and for some reason, this option is still available. Quite predictably, this flavor of ext3 have major problems when directories contain too many files.
| 216 | 218 | 220 | |
|---|---|---|---|
| create | 108.29s | 279.12s | 969.44s |
| delete | 26.10s | 146.82s | 424.64s |
| seek | 21.03s | 116.61s | 365.55s |
| seek+read | 150.53s | 280.87s | 504.70s |
ext3 without dir_index
Fortunately, modern ext3 implementations use B-trees for directory indices. This option is the default on virtually all non baroque GNU/Linux distribution so if you didn't twist any knobs at installation time, this is what you have.
| 216 | 218 | 220 | |
|---|---|---|---|
| create | 3.93s | 62.23s | 179.77s |
| delete | 17.75s | 113.42s | 272.55s |
| seek | 12.62s | 79.66s | 192.78s |
| seek+read | 144.61s | 241.12s | 355.39s |
ext3 with dir_index
With dir_index, ext3 is much faster at creating and deleting but it still slows down noticeably when we reach the million files mark. When looking for the state of the art in file systems, people often point to Reiser FS. The catch is that the really awesome file system is Reiser-4 while most distribution only ships with Reiser-3.
| 216 | 218 | 220 | |
|---|---|---|---|
| create | 16.87s | 43.58s | 119.04s |
| delete | 125.22s | 224.19s | 347.85s |
| seek | 74.10s | 124.59s | 190.53s |
| seek+read | 102.37s | 179.56s | 249.81s |
Reiser FS 3
Under a million files, ext3 is as good as Reiser-3. I'm not sure I'd advise someone to switch to Reiser-3 unless he is looking at some other metric like disk usage. I decided to also test XFS but I admit that this test isn't completely fair. XFS has many tunable parameters but I simply went with the defaults.
| 216 | 218 | 220 | |
|---|---|---|---|
| create | 203.04s | 181.30s | 350.25s |
| delete | 196.99s | 357.18s | 550.56s |
| seek | 18.06s | 89.35s | 231.65s |
| seek+read | 120.09s | 197.56s | 336.77s |
XFS with defaults
XFS behaves much like Reiser-3: there is a small speedup on file access but it's at the cost of a much slower index manipulation.
There is yet another option beside changing the file system: hashed
paths. The idea is to transparently distribute the files into
subdirectories to prevent too many files from being stored together.
By using a hash function, usually applied to the content of the
file, we ensures an even distribution of the file names. The hash
is then split into subdirectories. As an example, if a file hashes
to f6aa350811263c, it will be stored in
f/f6/f6aa350811263c. This effectively reduces the
number or files per directory by 256, on average. Obviously, there
is a non trivial overhead with hashed path and it involves hacking
the application code. Is it really worth it? I benchmarked the
implementation from Beaker, a
session framework for Python.
| 216 | 218 | 220 | |
|---|---|---|---|
| create | 44.22s | 136.79s | 257.82s |
| delete | 23.02s | 122.29s | 226.26s |
| seek | 19.23s | 76.48s | 149.32s |
| seek+read | 171.86s | 237.16s | 317.47s |
ext3 without dir_index, hashed paths
| 216 | 218 | 220 | |
|---|---|---|---|
| create | 53.14s | 156.51s | 312.89s |
| delete | 27.91s | 154.68s | 302.12s |
| seek | 24.94s | 95.74s | 209.73s |
| seek+read | 176.87s | 259.22s | 376.76s |
ext3 with dir_index, hashed paths
| 216 | 218 | 220 | |
|---|---|---|---|
| create | 29.55s | 80.72s | 161.47s |
| delete | 162.58s | 291.55s | 461.67s |
| seek | 100.00s | 167.54s | 271.79s |
| seek+read | 133.25s | 224.31s | 345.36s |
Reiser-3, hashed paths
| 216 | 218 | 220 | |
|---|---|---|---|
| create | 206.31s | 374.71s | 608.08s |
| delete | 168.65s | 340.82s | 552.25s |
| seek | 24.52s | 93.04s | 202.19s |
| seek+read | 165.73s | 213.31s | 293.43s |
XFS, defaults, hashed paths
Unless you have ext3 without dir_index, and there is a much better solution for that, the overhead of hashed paths is slowing you down. Beaker's implementation is lazy: it creates the subdirectories as it needs them so it needs to do a few extra checks every time it looks for a file. An implementation that expects the full subdirectories tree to exists would save a few steps but this is missing the big picture: if you have problems with too many files, you need to stop hitting the disk every time you are looking for an inode number. Not flushing the FS cache, that is, simulating the effect of more RAM, beats any other solution.
| 216 | 218 | 220 | |
|---|---|---|---|
| create | 0.98s | 29.32s | 90.91s |
| delete | 0.83s | 3.46s | 18.91s |
| seek | 0.16s | 0.16s | 49.71s |
| seek+read | 0.56s | 2.58s | 144.03s |
ext3 with dir_index, with FS cache
| 216 | 218 | 220 | |
|---|---|---|---|
| create | 1.26s | 32.54s | 109.83s |
| delete | 2.51s | 20.32s | 22.19s |
| seek | 0.40s | 0.41s | 56.93s |
| seek+read | 0.83s | 2.16s | 168.83s |
ext3 with dir_index, with FS cache, hached paths
It seems after all that ext3 is really good. So, if you end up
storing lots of files in the same directory, the only thing you have
to do it to ensure that your ext3 file system has dir_index
activated. You can verify that with tune2fs -l
/dev/foo. If you don't have dir_index, don't panic. You can
fix that from a live CD with
tune2fs -O dir_index /dev/foo
e2fsck -D /dev/foo
Don't take my word for it, run the benchmark yourself and hack it to measure the usage that worries you. And put more RAM in your boxen, that always works.
update: By popular demand, I benchmarked JFS. The index manipulation is really slow and the rest is somewhat comparable to XFS. It's said to excel with large file; definitely a specialized file system that should not be used for general usage. I won't pretend that a million files per directory is general usage but JFS noticeably slower even with 216 files. I'll try to benchmark Reiser 4 if I find a moment.
| 216 | 218 | 220 | |
|---|---|---|---|
| create | 92.26s | 273.34s | 491.12s |
| delete | 144.20s | 322.60s | 495.35s |
| seek | 62.87s | 183.33s | 272.39s |
| seek+read | 155.19s | 270.21s | 350.61s |
JFS
Comments
I'd be interested to see how JFS compares - it's remarkable faster than ext3 for large (>1GB) file creation and deletion, but I haven't done any benchmarking the "too many files" issue.
I just added the benchmark times for JFS.
What about ZFS performance? THAT's the hot fs of today.
In the hashed version, why use the first letter twice - that is,you have something "f6aa<something>" hashed to "f/f6/...". This seems odd as it requires two directory lookups, on information that is replicated. Notice that the "f6" level will contain exactly the same files as if there were no "f/" level. Why not eliminate one level of the directory and just use "f6/..." (and eliminate one lookup) or even (if there are enough files) use "f6/aa/..." - here you'd have two lookups, but the data subdirectory would be much smaller.
Having too many directories is not better but this is all configurable and a large site probably need to tune it a lot. As an example, you can pass the hash in base-64 instead of in hex; you get 4096 less files per directory instead of 256. But fetching a directory index is still a round trip to the platters so you don't want so many directories that you almost never see two files in a row in the same directory. If two files are in the same directory, you have the index in RAM so you can skip a physical lookup.
I did a little research on the small-file-performance problem myself recently.
My conclusions were:
- for every benchmark report, there's another one that conflicts with it
- for every operation that an FS does well, there's another one that it sucks at
My particular test was ext3 vs. Reiser3 vs. XFS. I timed how long it took to expand kernel source trees and Maildirs. Reiser3 won by an order of magnitude. It also sucked for a lot of other operations.
I'd try Reiser4, but it's not supported (easily) in Ubuntu, so it's automatically too hard.
ZFS is not particularly fast for any operations, especially under Linux with the FUSE layer. I'd be more worried about the ongoing memory leaks, to be honest - my fileserver gets rebooted every three days.
When you say "Each operation was repeated 16384 (2^14) times" do you mean that each operation was done on 2^14 files? Or do you mean that each operation was done on all the files in the directory in question 2^14 times?
I posted the code (last para) so you can verify what is measured exactly. Each operation was performed on 2^14 different files.
One big thing about reiserfs, did you enable or disable tails support? (eg notail in the mount options). Enabling tails makes lots of small files much more compact, but there's a definite performance penalty. I'd recommend you try both with and without to see the difference.
Also, none of this testing really actually helps in real world scenarios. You're starting with a clean filesystem and doing a single set of creates/writes/reads/deletes.
Also consider what happens to files over time. We run email servers, and that means that files are always being added/deleted continuously over time (about 5-10% turnover per-week by volume stored on average). That means your FS gets fragmented and tortured over time. We've found in that scenario that reiserfs with notails easily beats ext3, even with dir_indexes enabled.
Regarding ReiserFS 4, or installing any "beta" file system, I recommend running Rugg <http://rugg.sf.net> which really helps in exercising your hard drive + filesystem and see if your installation has a chance to fail or not. It can be reassuring for people in doubt.
Reiser4 is fast. You can even enable cryptcompress(because DATA = chr(0)*SIZE).
FreeDup from http://software.neuper.de/freedup/ uses similar ideas to replace redundant copies of files with hard links. I wouldn't recommend using it unless you understand how hard links work. That said, I've found it works suprisingly well on my OSX 10.5 laptop.
Hashed paths is quite an old idea, I've used it in the sense of giving each document an id number, and converting that into a base 51 number, using a as 0 and then a-zA-Z as the digits.
Format the number to a mimimum number of digits - like 51 = aaaaaaaZ, and reverse it Zaaaaaaa, shift one right, then place a / in between each digit, and add a document root /documents/prod/Z/a/a/a/a/a/a/51.doc.
Thus you can store your docs on up to 52 different partitions.. There are a limited number of levels to traverse, limited to 52 files (in a leaf) or directories (non-leaf) in each directory. You can have many many millions of files before your tree gets very full.

I highly recommend trying out reiser4. We did some informal benchmarking and it was significantly faster than anything else. It's ridiculous.
In terms of beta-ness. Its been in "beta" since about 2005, it's about as polished as any other filesystem. I've been using it on my laptop for over a year with various nasty-for-hard-drives habits like shutting it down abruptly, and it never lost a beat. And overall performance "feels" noticeably faster than plain ext3.
-- shazow (on #pylons occasionally)
I don't have a box with a spare partition that I can reboot at the moment. Could you run the benchmark for
mkfs.raiser4 -yandmkfs.ext3 -O dir_index? I'll normalize your results and post them with the others.reiser seems like a deadend to me -- I don't really see how it can continue with its namesake in jail and namesys running (ran?) out of money.
I mean, I know its open source and everything, but its never /really/ been open source in the sense of free community development -- its always been bankrolled by grant money from namesys...