Why is it important to delete files in-order to remove them faster?

Tags:

Some time ago I learned that rsync deletes files much faster that many other tools.

A few days ago I came across this wonderful answer on Serverfault which explains why rsync is so good at deleting files.

Quotation from that answer:

I revisited this today, because most filesystems store their directory structures in a btree format, the order of which you delete files is also important. One needs to avoid rebalancing the btree when you perform the unlink. As such I added a sort before deletes occur.

Could you explain how does removing files in-order prevents or reduces the number of btree rebalancings?

_{I expect the answer to show how deleting in order increase deletion speed, with details of what happens at btree level. People, who wrote rsync and another programs (see links in the question) used this knowledge to create better programs. I think it's important for other programmers to have this understanding to be able to write better soft.}

764

asked Jul 30 '13 19:07

ovgolovin

1 Answers

It is not important, nor b-tree issue. It is just a coincidence.

First of all, this is very much implementation dependent and very much ext3 specific. That's why I said it's not important (for general use). Otherwise, put the ext3 tag or edit the summary line.

Second of all, ext3 does not use b-tree for the directory entry index. It uses Htree. The Htree is similar to b-tree but different and does not require balancing. Search "htree" in fs/ext3/dir.c.

Because of the htree based index, a) ext3 has a faster lookup compare to ext2, but b) readdir() returns entries in hash value order. The hash value order is random relative to file creation time or physical layout of data. As we all know random access is much slower than sequential access on a rotating media.

A paper on ext3 published for OLS 2005 by Mingming Cao, et al. suggests (emphasis mine):

to sort the directory entries returned by readdir() by inode number.

Now, onto rsync. Rsync sorts files by file name. See flist.c::fsort(), flist.c::file_compare(), and flist.c::f_name_cmp().

I did not test the following hypothesis because I do not have the data sets from which @MIfe got 43 seconds. but I assume that sorted-by-name was much closer to the optimal order compare to the random order returned by readdir(). That was why you saw much faster result with rsync on ext3. What if you generate 1000000 files with random file names then delete them with rsync? Do you see the same result?

175

answered Oct 14 '22 18:10

Yasushi Shoji

Related questions
                            
                                Predicate vs Functions in First order logic
                            
                                Is it idiomatically ok to put algorithm into class?
                            
                                fast & efficient least squares fit algorithm in C?
                            
                                The simplest algorithm for poker hand evaluation
                            
                                Shortest distance between points algorithm
                            
                                how to measure running time of algorithms in python [duplicate]
                            
                                How does Firefox's 'awesome' bar match strings?
                            
                                Smoothing data from a sensor
                            
                                Why is O(n) better than O( nlog(n) )?
                            
                                Quickselect Algorithm - Simplified Explanation
                            
                                What are the rules for the "Ω(n log n) barrier" for sorting algorithms?
                            
                                PyMC: Taking advantage of sparse model structure in Adaptive Metropolis MCMC
                            
                                How can I generate an "unlimited" world?
                            
                                How Can I Best Guess the Encoding when the BOM (Byte Order Mark) is Missing?
                            
                                Travelling Salesman with multiple salesmen?
                            
                                Efficient algorithm for finding all maximal subsets
                            
                                Distribute points on a circle as evenly as possible
                            
                                Towers of Hanoi with K pegs
                            
                                Sorting algorithm to implement highest total combinations
                            
                                Compare two integer arrays with same length

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Why is it important to delete files in-order to remove them faster?

Tags:

algorithm

delete-file

filesystems

b-tree

ovgolovin

People also ask

1 Answers

Yasushi Shoji

Recent Activity

Donate For Us