Strategy for mass storage of small files

Question

What is the good strategy for mass storage for millions of small files (~50 KB on average) with auto-pruning of files older than 20 minutes? I need to write and access them from the web server.

I am currently using ext4, and during delete (scheduled in cron) HDD usage spikes up to 100% with [flush-8:0] showing up as the process that creates the load. This load is interferes with other applications on the server. When there are no deletes, max HDD utilisation is 0-5%. Situation is same with nested and non-nested directory structures. The worst part is that it seems that mass-removing during peak load is slower than the rate of insertions, so amount of files that need to be removed grows larger and larger.

I have tried changing schedulers (deadline, cfq, noop), it didn't help. I have also tried setting ionice to removing script, but it didn't help either.

I have tried GridFS with MongoDB 2.4.3 and it performs nicely, but horrible during mass delete of old files. I have tried running MongoDB with journaling turned off (nojournal) and without write confirmation for both delete and insert (w=0) and it didn't help. It only works fast and smooth when there are no deletes going on.

I have also tried storing data in MySQL 5.5, in BLOB column, in InnoDB table, with InnoDB engine set to use innodb_buffer_pool=2GB, innodb_log_file_size=1GB, innodb_flush_log_on_trx_commit=2, but the perfomance was worse, HDD load was always at 80%-100% (expected, but I had to try). Table was only using BLOB column, DATETIME column and CHAR(32) latin1_bin UUID, with indexes on UUID and DATETIME columns, so there was no room for optimization, and all queries were using indexes.

I have looked into pdflush settings (Linux flush process that creates the load during mass removal), but changing the values didn't help anything so I reverted to default.

It doesn't matter how often I run auto-pruning script, each 1 second, each 1 minute, each 5 minutes, each 30 minutes, it is disrupting server significantly either way.

I have tried to store inode value and when removing, remove old files sequentially by sorting them with their inode numbers first, but it didn't help.

Using CentOS 6. HDD is SSD RAID 1.

What would be good and sensible solution for my task that will solve auto-pruning performance problem?

Evgeny Kluev · Accepted Answer

If mass-removing millions of files results in performance problem, you can resolve this problem by "removing" all files at once. Instead of using any filesystem operation (like "remove" or "truncate") you could just create a new (empty) filesystem in place of the old one.

To implement this idea you need to split your drive into two (or more) partitions. After one partition is full (or after 20 minutes) you start writing to second partition while using the first one for reading only. After another 20 minutes you unmount first partition, create empty filesystem on it, mount it again, then start writing to first partition while using the second one for reading only.

The simplest solution is to use just two partitions. But this way you don't use disk space very efficiently: you can store twice less files on the same drive. With more partitions you can increase space efficiency.

If for some reason you need all your files in one place, use tmpfs to store links to files on each partition. This requires mass-removing millions of links from tmpfs, but this alleviates performance problem because only links should be removed, not file contents; also these links are to be removed only from RAM, not from SSD.

tmyklebu · Answer

Deletions are kind of a performance nuisance because both the data and the metadata need to get destroyed on disk.

Do they really need to be separate files? Do the old files really need to get deleted, or is it OK if they get overwritten?

If the answer is "no" to the second of these questions, try this:

Keep a list of files that's roughly sorted by age. Maybe chunk it by file size.
When you want to write to a new file, find an old file that's preferably bigger than what you're replacing it with. Instead of blowing away the old file, truncate() it to the appropriate length and then overwrite its contents. Make sure you update your old-files list.
Clean up the really old stuff that hasn't been replaced explicitly once in a while.
It might be advantageous to have an index into these files. Try using a tmpfs full of symbolic links to the real file system.

You might or might not get a performance advantage in this scheme by chunking the files in to manageably-sized subdirectories.

If you're OK with multiple things being in the same file:

Keep files of similar sizes together by storing each one as an offset into an array of similarly-sized files. If every file is 32k or 64k, keep a file full of 32k chunks and a file full of 64k chunks. If files are of arbitrary sizes, round up to the next power of two.
You can do lazy deletes here by keeping track of how stale each file is. If you're trying to write and something's stale, overwrite it instead of appending to the end of the file.

Another thought: Do you get a performance advantage by truncate()ing all of the files to length 0 in inode order and then unlink()ing them? Ignorance stops me from knowing whether this can actually help, but it seems like it would keep the data zeroing together and the metadata writing similarly together.

Yet another thought: XFS has a weaker write ordering model than ext4 with data=ordered. Is it fast enough on XFS?

Strategy for mass storage of small files

Tags:

performance

linux

database

optimization

filesystems

Atm

2 Answers

Evgeny Kluev

tmyklebu

Recent Activity

Donate For Us

Strategy for mass storage of small files

Tags:

performance

linux

database

optimization

filesystems

Atm

2 Answers

Evgeny Kluev

tmyklebu

Related questions

Recent Activity

Donate For Us