Too many files: Reiser FS vs hashed paths

2007-12-26 (permalink tags: , , )

On my hard drive, some directories attract lint faster than my CPU fan. I wipe then clean but as soon as I blink they already contain a zillion files. And for some reason, having a zillion files in a directory on GNU/Linux is a really bad thing.

With Gazest, I decided to store only the files meta data in the database and to keep the content on disk. To prevent name clashes, the content filenames are the HMAC-SHA1 hashes of the data. Of course, that means that I have to expect a zillion files in the content directory and as soon as you mention the problem of to many files you hear "Reiser FS," loud and proud from the cube next to yours. But it is really the best solution?

I'll focus on a single problem: access time of a file stored in a directory with a lot of other files. I don't care about CPU usage, disk usage, fragmentation, paging, or recovery tools. When I read a file, how long will it take? Thats it. Surprisingly, I could not find good benchmark for that typical use case. Sure, there are many anecdotal tales of people who noticed a significant improvement with a new FS when they re-installed their mail server but they end up comparing completely different setups with a variable real world server load. Lack of measurements lead to counter productive faith based debates so here is my attempt to quantify the issue.

First of all, is there a problem with having many files in the same directory? If you have enough RAM, not really. The Linux kernel caches files attributes when it finds spare RAM. That means that you could ask for a file and instantly know its inode number without hitting the platters. In the initial version of this benchmark, I didn't flush the file system cache and file access time was almost constant no matter how many files there was in a directory. But for most people, there isn't enough RAM so here is the improved benchmark.

For various file systems, I formated a partition then filled it with many 1 Kb files for three different definitions of "many": 216, 218, and 220. Then I did typical operations: create files, delete files, open files (seek only), and read files (seek+read), taking care of flushing the file system cache before each test. Each operation was repeated 16384 (214) times.

The default file system on GNU/Linux is ext3. There was a time when it implemented directory indices with a linked list and for some reason, this option is still available. Quite predictably, this flavor of ext3 have major problems when directories contain too many files.

216 218 220
create 108.29s 279.12s 969.44s
delete 26.10s 146.82s 424.64s
seek 21.03s 116.61s 365.55s
seek+read 150.53s 280.87s 504.70s

ext3 without dir_index

Fortunately, modern ext3 implementations use B-trees for directory indices. This option is the default on virtually all non baroque GNU/Linux distribution so if you didn't twist any knobs at installation time, this is what you have.

216 218 220
create 3.93s 62.23s 179.77s
delete 17.75s 113.42s 272.55s
seek 12.62s 79.66s 192.78s
seek+read 144.61s 241.12s 355.39s

ext3 with dir_index

With dir_index, ext3 is much faster at creating and deleting but it still slows down noticeably when we reach the million files mark. When looking for the state of the art in file systems, people often point to Reiser FS. The catch is that the really awesome file system is Reiser-4 while most distribution only ships with Reiser-3.

216 218 220
create 16.87s 43.58s 119.04s
delete 125.22s 224.19s 347.85s
seek 74.10s 124.59s 190.53s
seek+read 102.37s 179.56s 249.81s

Reiser FS 3

Under a million files, ext3 is as good as Reiser-3. I'm not sure I'd advise someone to switch to Reiser-3 unless he is looking at some other metric like disk usage. I decided to also test XFS but I admit that this test isn't completely fair. XFS has many tunable parameters but I simply went with the defaults.

216 218 220
create 203.04s 181.30s 350.25s
delete 196.99s 357.18s 550.56s
seek 18.06s 89.35s 231.65s
seek+read 120.09s 197.56s 336.77s

XFS with defaults

XFS behaves much like Reiser-3: there is a small speedup on file access but it's at the cost of a much slower index manipulation.

There is yet another option beside changing the file system: hashed paths. The idea is to transparently distribute the files into subdirectories to prevent too many files from being stored together. By using a hash function, usually applied to the content of the file, we ensures an even distribution of the file names. The hash is then split into subdirectories. As an example, if a file hashes to f6aa350811263c, it will be stored in f/f6/f6aa350811263c. This effectively reduces the number or files per directory by 256, on average. Obviously, there is a non trivial overhead with hashed path and it involves hacking the application code. Is it really worth it? I benchmarked the implementation from Beaker, a session framework for Python.

216 218 220
create 44.22s 136.79s 257.82s
delete 23.02s 122.29s 226.26s
seek 19.23s 76.48s 149.32s
seek+read 171.86s 237.16s 317.47s

ext3 without dir_index, hashed paths
216 218 220
create 53.14s 156.51s 312.89s
delete 27.91s 154.68s 302.12s
seek 24.94s 95.74s 209.73s
seek+read 176.87s 259.22s 376.76s

ext3 with dir_index, hashed paths
216 218 220
create 29.55s 80.72s 161.47s
delete 162.58s 291.55s 461.67s
seek 100.00s 167.54s 271.79s
seek+read 133.25s 224.31s 345.36s

Reiser-3, hashed paths
216 218 220
create 206.31s 374.71s 608.08s
delete 168.65s 340.82s 552.25s
seek 24.52s 93.04s 202.19s
seek+read 165.73s 213.31s 293.43s

XFS, defaults, hashed paths

Unless you have ext3 without dir_index, and there is a much better solution for that, the overhead of hashed paths is slowing you down. Beaker's implementation is lazy: it creates the subdirectories as it needs them so it needs to do a few extra checks every time it looks for a file. An implementation that expects the full subdirectories tree to exists would save a few steps but this is missing the big picture: if you have problems with too many files, you need to stop hitting the disk every time you are looking for an inode number. Not flushing the FS cache, that is, simulating the effect of more RAM, beats any other solution.

216 218 220
create 0.98s 29.32s 90.91s
delete 0.83s 3.46s 18.91s
seek 0.16s 0.16s 49.71s
seek+read 0.56s 2.58s 144.03s

ext3 with dir_index, with FS cache
216 218 220
create 1.26s 32.54s 109.83s
delete 2.51s 20.32s 22.19s
seek 0.40s 0.41s 56.93s
seek+read 0.83s 2.16s 168.83s

ext3 with dir_index, with FS cache, hached paths

It seems after all that ext3 is really good. So, if you end up storing lots of files in the same directory, the only thing you have to do it to ensure that your ext3 file system has dir_index activated. You can verify that with tune2fs -l /dev/foo. If you don't have dir_index, don't panic. You can fix that from a live CD with

    tune2fs -O dir_index /dev/foo
    e2fsck -D /dev/foo

Don't take my word for it, run the benchmark yourself and hack it to measure the usage that worries you. And put more RAM in your boxen, that always works.

update: By popular demand, I benchmarked JFS. The index manipulation is really slow and the rest is somewhat comparable to XFS. It's said to excel with large file; definitely a specialized file system that should not be used for general usage. I won't pretend that a million files per directory is general usage but JFS noticeably slower even with 216 files. I'll try to benchmark Reiser 4 if I find a moment.

216 218 220
create 92.26s 273.34s 491.12s
delete 144.20s 322.60s 495.35s
seek 62.87s 183.33s 272.39s
seek+read 155.19s 270.21s 350.61s

JFS

Comments

2007-12-26 02:34:46 by shazow (direct link | reply)

I highly recommend trying out reiser4. We did some informal benchmarking and it was significantly faster than anything else. It's ridiculous.

In terms of beta-ness. Its been in "beta" since about 2005, it's about as polished as any other filesystem. I've been using it on my laptop for over a year with various nasty-for-hard-drives habits like shutting it down abruptly, and it never lost a beat. And overall performance "feels" noticeably faster than plain ext3.

-- shazow (on #pylons occasionally)

2007-12-26 09:18:51 by Yannick (direct link | reply)

I don't have a box with a spare partition that I can reboot at the moment. Could you run the benchmark for mkfs.raiser4 -y and mkfs.ext3 -O dir_index? I'll normalize your results and post them with the others.

2007-12-27 14:58:32 by ted (direct link | reply)

reiser seems like a deadend to me -- I don't really see how it can continue with its namesake in jail and namesys running (ran?) out of money.

I mean, I know its open source and everything, but its never /really/ been open source in the sense of free community development -- its always been bankrolled by grant money from namesys...

2007-12-27 10:20:53 by Alex (direct link | reply)

I'd be interested to see how JFS compares - it's remarkable faster than ext3 for large (>1GB) file creation and deletion, but I haven't done any benchmarking the "too many files" issue.

2007-12-27 15:12:07 by Yannick (direct link | reply)

I just added the benchmark times for JFS.

2007-12-27 11:40:14 by Ponny (direct link | reply)

What about ZFS performance? THAT's the hot fs of today.

2007-12-27 15:50:40 by jeff (direct link | reply)

In the hashed version, why use the first letter twice - that is,you have something "f6aa<something>" hashed to "f/f6/...". This seems odd as it requires two directory lookups, on information that is replicated. Notice that the "f6" level will contain exactly the same files as if there were no "f/" level. Why not eliminate one level of the directory and just use "f6/..." (and eliminate one lookup) or even (if there are enough files) use "f6/aa/..." - here you'd have two lookups, but the data subdirectory would be much smaller.

2007-12-27 16:55:24 by Yannick (direct link | reply)

Having too many directories is not better but this is all configurable and a large site probably need to tune it a lot. As an example, you can pass the hash in base-64 instead of in hex; you get 4096 less files per directory instead of 256. But fetching a directory index is still a round trip to the platters so you don't want so many directories that you almost never see two files in a row in the same directory. If two files are in the same directory, you have the index in RAM so you can skip a physical lookup.

2007-12-27 18:05:47 by Ian (direct link | reply)

I did a little research on the small-file-performance problem myself recently.

My conclusions were:

  • for every benchmark report, there's another one that conflicts with it
  • for every operation that an FS does well, there's another one that it sucks at

My particular test was ext3 vs. Reiser3 vs. XFS. I timed how long it took to expand kernel source trees and Maildirs. Reiser3 won by an order of magnitude. It also sucked for a lot of other operations.

I'd try Reiser4, but it's not supported (easily) in Ubuntu, so it's automatically too hard.

ZFS is not particularly fast for any operations, especially under Linux with the FUSE layer. I'd be more worried about the ongoing memory leaks, to be honest - my fileserver gets rebooted every three days.

2007-12-27 19:10:22 by Ethan Herdrick (direct link | reply)

When you say "Each operation was repeated 16384 (2^14) times" do you mean that each operation was done on 2^14 files? Or do you mean that each operation was done on all the files in the directory in question 2^14 times?

2007-12-27 19:35:36 by Yannick (direct link | reply)

I posted the code (last para) so you can verify what is measured exactly. Each operation was performed on 2^14 different files.

2007-12-27 22:52:01 by Rob Mueller (direct link | reply)

One big thing about reiserfs, did you enable or disable tails support? (eg notail in the mount options). Enabling tails makes lots of small files much more compact, but there's a definite performance penalty. I'd recommend you try both with and without to see the difference.

Also, none of this testing really actually helps in real world scenarios. You're starting with a clean filesystem and doing a single set of creates/writes/reads/deletes.

Also consider what happens to files over time. We run email servers, and that means that files are always being added/deleted continuously over time (about 5-10% turnover per-week by volume stored on average). That means your FS gets fragmented and tortured over time. We've found in that scenario that reiserfs with notails easily beats ext3, even with dir_indexes enabled.

2007-12-28 03:18:53 by Sébastien Pierre (direct link | reply)

Regarding ReiserFS 4, or installing any "beta" file system, I recommend running Rugg <http://rugg.sf.net> which really helps in exercising your hard drive + filesystem and see if your installation has a chance to fail or not. It can be reassuring for people in doubt.

2007-12-28 05:30:22 by fcicq (direct link | reply)

Reiser4 is fast. You can even enable cryptcompress(because DATA = chr(0)*SIZE).

2007-12-28 13:42:50 by bayareaguy (direct link | reply)

FreeDup from http://software.neuper.de/freedup/ uses similar ideas to replace redundant copies of files with hard links. I wouldn't recommend using it unless you understand how hard links work. That said, I've found it works suprisingly well on my OSX 10.5 laptop.

2008-01-01 15:09:14 by James Richardson (direct link | reply)

Hashed paths is quite an old idea, I've used it in the sense of giving each document an id number, and converting that into a base 51 number, using a as 0 and then a-zA-Z as the digits.

Format the number to a mimimum number of digits - like 51 = aaaaaaaZ, and reverse it Zaaaaaaa, shift one right, then place a / in between each digit, and add a document root /documents/prod/Z/a/a/a/a/a/a/51.doc.

Thus you can store your docs on up to 52 different partitions.. There are a limited number of levels to traverse, limited to 52 files (in a leaf) or directories (non-leaf) in each directory. You can have many many millions of files before your tree gets very full.

Leave a comment