Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What is the best and the fastest way to delete large directory containing thousands of files (in ubuntu)

Tags:

linux

bash

As I know the commands like

find <dir> -type f -exec rm {} \;

are not the best variant to remove large amount of files (total files, including subfolder). It works good if you have small amount of files, but if you have 10+ mlns files in subfolders, it can hang a server.

Does anyone know any specific linux commands to solve this problem?

like image 798
itereshchenkov Avatar asked Jul 05 '12 07:07

itereshchenkov


2 Answers

It may seem strange but:

$ rm -rf <dir>
like image 63
Igor Chubin Avatar answered Nov 10 '22 21:11

Igor Chubin


Here's an example bash script:

#!/bin/bash

local LOCKFILE=/tmp/rmHugeNumberOfFiles.lock

# this process gets ultra-low priority
ionice -c2 -n7 -p $$ > /dev/null
if [ $? ]; then
    echo "Could not set disk IO priority. Exiting..."
    exit
fi
renice +19 -p $$ > /dev/null
if [ $? ]; then
    echo "Could not renice process. Exiting..."
    exit
fi

# check if there's an instance running already. If so--exit
if [ -e ${LOCKFILE} ] && kill -0 `cat ${LOCKFILE}`; then
    echo "An instance of this script is already running."
    exit
fi

# make sure the lockfile is removed when we exit. Then: claim the lock
trap "command rm -f -- $LOCKFILE; exit" INT TERM EXIT
echo $$ > $LOCKFILE

# also create a tempfile, and make sure that's removed too upon exit
tmp=$(tempfile) || exit
trap "command rm -f -- '$tmp'" INT TERM EXIT



# ----------------------------------------
# option 1
# ----------------------------------------
# find your specific files
find "$1" -type f [INSERT SPECIFIC SEARCH PATTERN HERE] > "$tmp"
cat $tmp | rm 

# ----------------------------------------
# option 2
# ----------------------------------------
command rm -r "$1"



# remove the lockfile, tempfile
command rm -f -- "$tmp" $LOCKFILE

This script starts by setting its own process priority and diskIO priority to very low values, to ensure other running processes are as unaffected as possible.

Then it makes sure that it is the ONLY such process running.

The core of the script is really up to your preference. You can use rm -r if you are sure that the whole dir can be deleted indesciminately (option 2), or you can use find for more specific file deletion (option 1, possibly using command line options "$2" and onw. for convenience).

In the implementation above, Option 1 (find) first outputs everything to a tempfile, so that the rm function is only called once instead of after each file found by find. When the number of files is indeed huge, this can amount to a significant time saving. On the downside, the size of the tempfile may become an issue, but this is only likely if you're deleting literally billions of files, plus, because the diskIO has such low priority, using a tempfile followed by a single rm may in total be slower than using the find (...) -exec rm {} \; option. As always, you should experiment a bit to see what best fits your needs.

EDIT: As suggested by user946850, you can also skip the whole tempfile and use find (...) -print0 | xargs -0 rm. This has a larger memory footprint, since all full paths to all matching files will be inserted in RAM until the find command is completely finished. On the upside: there is no additional file IO due to writes to the tempfile. Which one to choose depends on your use-case.

like image 23
Rody Oldenhuis Avatar answered Nov 10 '22 21:11

Rody Oldenhuis