Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to quickly delete millions of files

Tags:

c++

boost

We use BOOST1.63 boost::filesystem::remove_all(dir_to_remove) to remove a folder that has millions of files (each file has size of 1MB). The folder "dir_to_remove" has sub-folders, and each sub-folder has no more than 1000 files. It takes more than 10-min to delete all files. We use CentOS6.5.

After checking the operations.cpp, we realized BOOST actually uses Linux rmdir and unlink commands:

#   define BOOST_REMOVE_DIRECTORY(P)(::rmdir(P)== 0)
#   define BOOST_DELETE_FILE(P)(::unlink(P)== 0)

This article listed several ways to delete files more efficiently on Linux. And it recommended to use rsync.

How can we delete millions of files quickly with C++?

like image 517
werk Avatar asked Apr 25 '18 00:04

werk


People also ask

How do I delete a folder with millions of files?

Delete key You can delete multiple files or folders by holding down the Ctrl key and clicking each file or folder before pressing Delete . You can hold down the Shift key while pressing the Delete key to prevent files from going to the Recycle Bin when deleted.

How do I delete a lot of files fast?

Open Start on Windows 10. Search for Command Prompt, right-click the top result, and select the Run as administrator option. In the command, make sure to update the path to the folder you want to delete. In the command, we use the /f option to force the deletion of read-only files.


3 Answers

If you want to get free a required location, the fastest way to do it is moving (or renaming) a directory to another location on the same partition. Then your program can continue work with a required location and remove a previously moved directory recursively in another thread (in background). This thread can even work with less priority, so removing a specific directory would look like an instant filesystem operation.

like image 161
273K Avatar answered Oct 17 '22 10:10

273K


Yeah, std::filesystem::directory_iterator is pretty borked. I'm looking to replace that facility entirely in upcoming P1031 Low level file i/o (note won't be live on WG21 until June 2018) with something which scales well to input, so we're on it.

In the meantime, I'd suggest that you use https://ned14.github.io/afio/ which is the reference implementation for P1031, specifically directory_handle::enumerate(). This library handles directories with millions, even tens of millions, of files with ease. Once you have your list of entries to delete, you need to follow a B+-tree friendly deletion pattern i.e. sort them into either alphabetical or inode order, then do one of:

  1. Unlink from the first entry going forwards.
  2. Unlink from the last entry going backwards.
  3. Unlink from the first entry, then the last entry, moving towards the centre.

I'd benchmark all six approaches for your particular filing system, and choose whichever is the fastest. Some use B+ trees based on inode number, some based on leafname, it varies. But basically you want to avoid excessive tree rebalancing, and avoid deep O(log N) lookups of the leafname, hence the ordered unlinks.

like image 31
Niall Douglas Avatar answered Oct 17 '22 09:10

Niall Douglas


The article you link to talks about the shell perspective. That is critically important: the shell starts programs for many tasks. And while starting a program is very cheap, it can be expensive when you need to start a million programs. That's why rsync is so effective; a single invocation can do all the work.

The same already applies to your program. You start your program once; the cost is simply all the syscalls you're making.

I checked the syscall list; there's no syscall that allows you to do a bulk removal with one syscall so you're limited to one syscall per file to remove.

like image 1
MSalters Avatar answered Oct 17 '22 10:10

MSalters