We use BOOST1.63 boost::filesystem::remove_all(dir_to_remove)
to remove a folder that has millions of files (each file has size of 1MB). The folder "dir_to_remove" has sub-folders, and each sub-folder has no more than 1000 files. It takes more than 10-min to delete all files. We use CentOS6.5.
After checking the operations.cpp, we realized BOOST actually uses Linux rmdir
and unlink
commands:
# define BOOST_REMOVE_DIRECTORY(P)(::rmdir(P)== 0)
# define BOOST_DELETE_FILE(P)(::unlink(P)== 0)
This article listed several ways to delete files more efficiently on Linux. And it recommended to use rsync
.
How can we delete millions of files quickly with C++?
Delete key You can delete multiple files or folders by holding down the Ctrl key and clicking each file or folder before pressing Delete . You can hold down the Shift key while pressing the Delete key to prevent files from going to the Recycle Bin when deleted.
Open Start on Windows 10. Search for Command Prompt, right-click the top result, and select the Run as administrator option. In the command, make sure to update the path to the folder you want to delete. In the command, we use the /f option to force the deletion of read-only files.
If you want to get free a required location, the fastest way to do it is moving (or renaming) a directory to another location on the same partition. Then your program can continue work with a required location and remove a previously moved directory recursively in another thread (in background). This thread can even work with less priority, so removing a specific directory would look like an instant filesystem operation.
Yeah, std::filesystem::directory_iterator
is pretty borked. I'm looking to replace that facility entirely in upcoming P1031 Low level file i/o (note won't be live on WG21 until June 2018) with something which scales well to input, so we're on it.
In the meantime, I'd suggest that you use https://ned14.github.io/afio/ which is the reference implementation for P1031, specifically directory_handle::enumerate()
. This library handles directories with millions, even tens of millions, of files with ease. Once you have your list of entries to delete, you need to follow a B+-tree friendly deletion pattern i.e. sort them into either alphabetical or inode order, then do one of:
I'd benchmark all six approaches for your particular filing system, and choose whichever is the fastest. Some use B+ trees based on inode number, some based on leafname, it varies. But basically you want to avoid excessive tree rebalancing, and avoid deep O(log N) lookups of the leafname, hence the ordered unlinks.
The article you link to talks about the shell perspective. That is critically important: the shell starts programs for many tasks. And while starting a program is very cheap, it can be expensive when you need to start a million programs. That's why rsync
is so effective; a single invocation can do all the work.
The same already applies to your program. You start your program once; the cost is simply all the syscalls you're making.
I checked the syscall list; there's no syscall that allows you to do a bulk removal with one syscall so you're limited to one syscall per file to remove.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With