Some time ago I learned that rsync
deletes files much faster that many other tools.
A few days ago I came across this wonderful answer on Serverfault which explains why rsync
is so good at deleting files.
Quotation from that answer:
I revisited this today, because most filesystems store their directory structures in a btree format, the order of which you delete files is also important. One needs to avoid rebalancing the btree when you perform the unlink. As such I added a sort before deletes occur.
Could you explain how does removing files in-order prevents or reduces the number of btree rebalancings?
I expect the answer to show how deleting in order increase deletion speed, with details of what happens at btree
level. People, who wrote rsync
and another programs (see links in the question) used this knowledge to create better programs. I think it's important for other programmers to have this understanding to be able to write better soft.
Examples of reasons for deleting files are: Freeing the disk space. Removing duplicate or unnecessary data to avoid confusion. Making sensitive information unavailable to others.
When deleting a file, most Operating Systems will mark the file as deleted, but not actually remove the data from the hard drive. This allows for a fast delete as the OS just has to set one flag and not touch any of the data. When copying a file, data actually has to be duplicated in order to copy a file.
Delete temporary files.Deleting them frees up valuable space on your hard disk and speeds up your computer.
When you first delete a file, it is moved to the computer's Recycle Bin, Trash, or something similar depending on your operating system. When something is sent to the Recycle Bin or Trash, the icon changes to indicate it contains files and if needed lets you recover a deleted file.
It is not important, nor b-tree issue. It is just a coincidence.
First of all, this is very much implementation dependent and very much ext3 specific. That's why I said it's not important (for general use). Otherwise, put the ext3 tag or edit the summary line.
Second of all, ext3 does not use b-tree for the directory entry index. It uses Htree. The Htree is similar to b-tree but different and does not require balancing. Search "htree" in fs/ext3/dir.c.
Because of the htree based index, a) ext3 has a faster lookup compare to ext2, but b) readdir()
returns entries in hash value order. The hash value order is random relative to file creation time or physical layout of data. As we all know random access is much slower than sequential access on a rotating media.
A paper on ext3 published for OLS 2005 by Mingming Cao, et al. suggests (emphasis mine):
to sort the directory entries returned by readdir() by inode number.
Now, onto rsync. Rsync sorts files by file name. See flist.c::fsort(), flist.c::file_compare(), and flist.c::f_name_cmp().
I did not test the following hypothesis because I do not have the data sets from which @MIfe got 43 seconds. but I assume that sorted-by-name was much closer to the optimal order compare to the random order returned by readdir(). That was why you saw much faster result with rsync on ext3. What if you generate 1000000 files with random file names then delete them with rsync? Do you see the same result?
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With