Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why is it important to delete files in-order to remove them faster?

Some time ago I learned that rsync deletes files much faster that many other tools.

A few days ago I came across this wonderful answer on Serverfault which explains why rsync is so good at deleting files.

Quotation from that answer:

I revisited this today, because most filesystems store their directory structures in a btree format, the order of which you delete files is also important. One needs to avoid rebalancing the btree when you perform the unlink. As such I added a sort before deletes occur.

Could you explain how does removing files in-order prevents or reduces the number of btree rebalancings?


I expect the answer to show how deleting in order increase deletion speed, with details of what happens at btree level. People, who wrote rsync and another programs (see links in the question) used this knowledge to create better programs. I think it's important for other programmers to have this understanding to be able to write better soft.

like image 764
ovgolovin Avatar asked Jul 30 '13 19:07

ovgolovin


People also ask

Why is it important to delete files?

Examples of reasons for deleting files are: Freeing the disk space. Removing duplicate or unnecessary data to avoid confusion. Making sensitive information unavailable to others.

Why is deleting files faster?

When deleting a file, most Operating Systems will mark the file as deleted, but not actually remove the data from the hard drive. This allows for a fast delete as the OS just has to set one flag and not touch any of the data. When copying a file, data actually has to be duplicated in order to copy a file.

Will deleting files speed up a computer?

Delete temporary files.Deleting them frees up valuable space on your hard disk and speeds up your computer.

What happens when you delete files?

When you first delete a file, it is moved to the computer's Recycle Bin, Trash, or something similar depending on your operating system. When something is sent to the Recycle Bin or Trash, the icon changes to indicate it contains files and if needed lets you recover a deleted file.


1 Answers

It is not important, nor b-tree issue. It is just a coincidence.

First of all, this is very much implementation dependent and very much ext3 specific. That's why I said it's not important (for general use). Otherwise, put the ext3 tag or edit the summary line.

Second of all, ext3 does not use b-tree for the directory entry index. It uses Htree. The Htree is similar to b-tree but different and does not require balancing. Search "htree" in fs/ext3/dir.c.

Because of the htree based index, a) ext3 has a faster lookup compare to ext2, but b) readdir() returns entries in hash value order. The hash value order is random relative to file creation time or physical layout of data. As we all know random access is much slower than sequential access on a rotating media.

A paper on ext3 published for OLS 2005 by Mingming Cao, et al. suggests (emphasis mine):

to sort the directory entries returned by readdir() by inode number.

Now, onto rsync. Rsync sorts files by file name. See flist.c::fsort(), flist.c::file_compare(), and flist.c::f_name_cmp().

I did not test the following hypothesis because I do not have the data sets from which @MIfe got 43 seconds. but I assume that sorted-by-name was much closer to the optimal order compare to the random order returned by readdir(). That was why you saw much faster result with rsync on ext3. What if you generate 1000000 files with random file names then delete them with rsync? Do you see the same result?

like image 175
Yasushi Shoji Avatar answered Oct 14 '22 18:10

Yasushi Shoji