Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Improve performance of moving a growing, large number of files in a mounted folder

This is my situation:

A have a Windows network share, i mount with mount -t cifs -o username=username,password=password,rw,nounix,iocharset=utf8,file_mode=0777,dir_mode=0777 //192.168.1.120/storage /mnt/storage

This folder contains a very rapidly growing number of files of various size (few byte up to ~20MB). If not moved/deleted the number of files in this directory may exceed 10 million.

I am required to move batches (in move_script) of files with a specific name (*.fext) from this directory to another directory (currently a subfolder in the directory /mnt/storage/in_progress).

Then the script starts another script (process_script) which will process the files in /mnt/storage/in_progress. After process_script finished, the files are moved again by move_script to another subdirectory (/mnt/storage/done). The move-process-move continues until the source folder (/mnt/storage) contains no more files.

Additional information of the process:

  • the current bottleneck is the moving of the files (the files are moved a little faster than the files are created in the directory)

    if len(os.listdir("/mnt/storage") >= batch_size:
        i = 0
        for f in os.listdir("/mnt/storage"):
            if f.endswith(".fext"):
                move("/mnt/storage/+"f","/mnt/storage/in_progress"
                i+=1
            if i==batch_size:
                break
    
  • the script moving/starting the processing of the files, waits for the processing to finish

  • processing of the files in /mnt/storage/in_progress is fastest with batches of 1k-2k files.

  • I have tried to have the number of files that are moved growing. First move 1k, then if the number of files in the source directory is growing, double number of files that are moved.. This slows down the processing of the files in process_script, but helps to keep up with the "file-generator"..

  • I considered to simply rename the subdirectory /mnt/storage/in_progress, after process_script finished to "/mnt/storage/done"+i_counter and create a new /mnt/storage/in_progress. I assume this would half the move-time in the script.

I'll need to speed up the process, in order to keep up with the file-generator. How could I increase the performance of this move operation?

I'm open for any suggestion and willing to completely change my current approach.

edit: The scripts run on debian wheezy, so I theoretically could use a subprocess issuing mv, but I have no clue how reasonable that would be.

==========================================

edit2: I wrote a script to test the speed differences between the various methods of moving. First created 1x5GB (dd if=/dev/urandom of=/mnt/storage/source/test.file bs=100M count=50), then with 100x5MB (for i in {1..100}; do dd if=/dev/urandom of=/mnt/storage/source/file$i bs=1M count=5) and finally with 10000x5kB (for i in {1..100000}; do dd if=/dev/urandom of=/mnt/storage/source/file$i bs=1k count=5)

from shutil import move
from os import rename
from datetime import datetime
import subprocess
import os

print("Subprocess mv: for every file in directory..")
s = datetime.now()
for f in os.listdir("/mnt/storage/source/"):
    try:
        subprocess.call(["mv /mnt/storage/source/"+str(f)+" /mnt/storage/mv"],shell=True)
    except Exception as e:
        print(str(e))
e = datetime.now()
print("took {}".format(e-s)+"\n")

print("Subprocessmv : directory/*..")
s = datetime.now()
try:
    subprocess.call(["mv /mnt/storage/mv/* /mnt/storage/mvf"],shell=True)
except Exception as e:
    print(str(e))
e = datetime.now()
print("took {}".format(e-s)+"\n")


print("shutil.move: for every file file in directory..")
s = datetime.now()
for f in os.listdir("/mnt/storage/mvf/"):
    try:    
        move("/mnt/storage/mvf/"+str(f),"/mnt/storage/move")
    except Exception as e:
        print(str(e))
e = datetime.now()
print("took {}".format(e-s)+"\n")

print("os.rename: for every file in directory..")
s = datetime.now()
for f in os.listdir("/mnt/storage/move/"):
    try:
        rename("/mnt/storage/move/"+str(f),"/mnt/storage/rename/"+str(f))
    except Exception as e:
        print(str(e))
e = datetime.now()
print("took {}".format(e-s)+"\n")


if os.path.isdir("/mnt/storage/rename_new"):
    rmtree('/mnt/storage/rename_new')
print("os.rename & os.mkdir: rename source dir to destination & make new source dir..")
s = datetime.now()
rename("/mnt/storage/rename/","/mnt/storage/rename_new")
os.mkdir("/mnt/storage/rename/")
e = datetime.now()
print("took {}".format(e-s)+"\n")

Which revealed that there's not that much of a difference.. The 5GB file was moved really fast, which tells me that the moving, by altering the file table works. Here are the results of the 10000*5kB files (It felt like the results depend on the current network workload. e.g. the first mv test took 2m 28s, than later with the same files 3m 22s, also was os.rename() the fastest method most of the times..):

Subprocess mv: for every file in directory..
took 0:02:47.665174

Subprocessmv : directory/*..
took 0:01:40.087872

shutil.move: for every file file in directory..
took 0:01:48.454184

os.rename: for every file in directory..
rename took 0:02:05.597933

os.rename & os.mkdir: rename source dir to destination & make new source dir..
took 0:00:00.005704
like image 396
Daedalus Mythos Avatar asked Jan 17 '26 02:01

Daedalus Mythos


1 Answers

You can simplify the code by using the glob module to list the files. But most likely the limiting factor is the network. Most likely the files end up being copied over the network instead of simply moved. Otherwise that process would be very fast.

Try using os.rename() to move the files. It may not work on a cifs filesystem, but it's worth a try. That should do an actual rename, not a copy. If it doesn't work, you may need to mount that filesystem some other way. Or run the moving process on the machine where the filesystem exists.

like image 188
Lennart Regebro Avatar answered Jan 19 '26 16:01

Lennart Regebro



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!