We use BOOST1.63 <code>boost::filesystem::remove_all(dir_to_remove)</code> to remove a folder that has millions of files (each file has size of 1MB). The folder "dir_to_remove" has sub-folders, and each sub-folder has no more than 1000 files. It takes more than 10-min to delete all files. We use CentOS6.5. After checking the operations.cpp, we realized BOOST actually uses Linux <code>rmdir</code> and <code>unlink</code> commands: <pre class="prettyprint"><code># define BOOST_REMOVE_DIRECTORY(P)(::rmdir(P)== 0) # define BOOST_DELETE_FILE(P)(::unlink(P)== 0) </code></pre> This article listed several ways to delete files more efficiently on Linux. And it recommended to use <code>rsync</code>. How can we delete millions of files quickly with C++?

The article you link to talks about the shell perspective. That is critically important: the shell starts programs for many tasks. And while starting a program is very cheap, it can be expensive when you need to start a million programs. That's why <code>rsync</code> is so effective; a single invocation can do all the work. The same already applies to your program. You start your program once; the cost is simply all the syscalls you're making. I checked the syscall list; there's no syscall that allows you to do a bulk removal with one syscall so you're limited to one syscall per file to remove.

How to quickly delete millions of files

Tags:

c++

boost

We use BOOST1.63 boost::filesystem::remove_all(dir_to_remove) to remove a folder that has millions of files (each file has size of 1MB). The folder "dir_to_remove" has sub-folders, and each sub-folder has no more than 1000 files. It takes more than 10-min to delete all files. We use CentOS6.5.

After checking the operations.cpp, we realized BOOST actually uses Linux rmdir and unlink commands:

#   define BOOST_REMOVE_DIRECTORY(P)(::rmdir(P)== 0)
#   define BOOST_DELETE_FILE(P)(::unlink(P)== 0)

This article listed several ways to delete files more efficiently on Linux. And it recommended to use rsync.

How can we delete millions of files quickly with C++?

517

asked Apr 25 '18 00:04

werk

3 Answers

If you want to get free a required location, the fastest way to do it is moving (or renaming) a directory to another location on the same partition. Then your program can continue work with a required location and remove a previously moved directory recursively in another thread (in background). This thread can even work with less priority, so removing a specific directory would look like an instant filesystem operation.

161

answered Oct 17 '22 10:10

273K

Yeah, std::filesystem::directory_iterator is pretty borked. I'm looking to replace that facility entirely in upcoming P1031 Low level file i/o (note won't be live on WG21 until June 2018) with something which scales well to input, so we're on it.

In the meantime, I'd suggest that you use https://ned14.github.io/afio/ which is the reference implementation for P1031, specifically directory_handle::enumerate(). This library handles directories with millions, even tens of millions, of files with ease. Once you have your list of entries to delete, you need to follow a B+-tree friendly deletion pattern i.e. sort them into either alphabetical or inode order, then do one of:

Unlink from the first entry going forwards.
Unlink from the last entry going backwards.
Unlink from the first entry, then the last entry, moving towards the centre.

I'd benchmark all six approaches for your particular filing system, and choose whichever is the fastest. Some use B+ trees based on inode number, some based on leafname, it varies. But basically you want to avoid excessive tree rebalancing, and avoid deep O(log N) lookups of the leafname, hence the ordered unlinks.

answered Oct 17 '22 09:10

Niall Douglas

The article you link to talks about the shell perspective. That is critically important: the shell starts programs for many tasks. And while starting a program is very cheap, it can be expensive when you need to start a million programs. That's why rsync is so effective; a single invocation can do all the work.

The same already applies to your program. You start your program once; the cost is simply all the syscalls you're making.

I checked the syscall list; there's no syscall that allows you to do a bulk removal with one syscall so you're limited to one syscall per file to remove.

answered Oct 17 '22 10:10

MSalters

Related questions
                            
                                When you define a value in C how does the compiler select the data type
                            
                                overloaded function with no contextual type information
                            
                                Pass arbitrary Javascript data object to Node.js C++ addon
                            
                                Could someone explain this C++ union example?
                            
                                About letter f (float type) in C/C++
                            
                                Error with for_each, map and lambda when passing by non-auto non-const lvalue reference
                            
                                Auto variable to store function pointer to std::max
                            
                                Draw Line using c++ without graphics
                            
                                static constexpr template member gives undefined-reference when specialized
                            
                                force that part of a c++ compiled as C
                            
                                Inheritance class C++
                            
                                why C/C++ compiler not always make ++a atomic?
                            
                                Returning `const char*` from native code and getting `String` in java
                            
                                Initialize values of a struct pointer
                            
                                Why does the STL reserve an interface for Allocator?
                            
                                How to add Qt libraries to visual studio
                            
                                How to implement lane crossing logical bit-wise shift/rotate (left and right) in AVX2
                            
                                Memory consumption after new then delete
                            
                                How to create standalone exe in clion? [closed]
                            
                                Why are structured bindings defined in terms of a uniquely named variable?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With