Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python multiprocess/multithreading to speed up file copying

I have a program which copies large numbers of files from one location to another - I'm talking 100,000+ files (I'm copying 314g in image sequences at this moment). They're both on huge, VERY fast network storage RAID'd in the extreme. I'm using shutil to copy the files over sequentially and it is taking some time, so I'm trying to find the best way to opimize this. I've noticed some software I use effectively multi-threads reading files off of the network with huge gains in load times so I'd like to try doing this in python.

I have no experience with programming multithreading/multiprocessesing - does this seem like the right area to proceed? If so what's the best way to do this? I've looked around a few other SO posts regarding threading file copying in python and they all seemed to say that you get no speed gain, but I do not think this will be the case considering my hardware. I'm nowhere near my IO cap at the moment and resources are sitting around 1% (I have 40 cores and 64g of RAM locally).

EDIT

Been getting some up-votes on this question (now a few years old) so I thought I'd point out one more thing to speed up file copies. In addition to the fact that you can easily 8x-10x copy speeds using some of the answers below (seriously!) I have also since found that shutil.copy2 is excruciatingly slow for no good reason. Yes, even in python 3+. It is beyond the scope of this question so I won't dive into it here (it's also highly OS and hardware/network dependent), beyond just mentioning that by tweaking the copy buffer size in the copy2 function you can increase copy speeds by yet another factor of 10! (however note that you will start running into bandwidth limits and the gains are not linear when multi-threading AND tweaking buffer sizes. At some point it does flat line).

like image 795
Spencer Avatar asked Jun 02 '17 03:06

Spencer


People also ask

Does multithreading speed up Python?

Both multithreading and multiprocessing allow Python code to run concurrently. Only multiprocessing will allow your code to be truly parallel. However, if your code is IO-heavy (like HTTP requests), then multithreading will still probably speed up your code.

How can multithreading improve performance of the program in Python?

Multithreading in Python streamlines the efficient utilization of resources as the threads share the same memory and data space. It also allows the concurrent appearance of multiple tasks and reduces the response time. This improves the performance.

Is multi threading good in Python?

To recap, threading in Python allows multiple threads to be created within a single process, but due to GIL, none of them will ever run at the exact same time. Threading is still a very good option when it comes to running multiple I/O bound tasks concurrently.

Which is faster multithreading or multiprocessing?

Threads are faster to start than processes and also faster in task-switching. All Threads share a process memory pool that is very beneficial. Takes lesser time to create a new thread in the existing process than a new process.


1 Answers

UPDATE:

I never did get Gevent working (first answer) because I couldn't install the module without an internet connection, which I don't have on my workstation. However I was able to decrease file copy times by 8 just using the built in threading with python (which I have since learned how to use) and I wanted to post it up as an additional answer for anyone interested! Here's my code below, and it is probably important to note that my 8x copy time will most likely differ from environment to environment due to your hardware/network set-up.

import Queue, threading, os, time
import shutil

fileQueue = Queue.Queue()
destPath = 'path/to/cop'

class ThreadedCopy:
    totalFiles = 0
    copyCount = 0
    lock = threading.Lock()

    def __init__(self):
        with open("filelist.txt", "r") as txt: #txt with a file per line
            fileList = txt.read().splitlines()

        if not os.path.exists(destPath):
            os.mkdir(destPath)

        self.totalFiles = len(fileList)

        print str(self.totalFiles) + " files to copy."
        self.threadWorkerCopy(fileList)


    def CopyWorker(self):
        while True:
            fileName = fileQueue.get()
            shutil.copy(fileName, destPath)
            fileQueue.task_done()
            with self.lock:
                self.copyCount += 1
                percent = (self.copyCount * 100) / self.totalFiles
                print str(percent) + " percent copied."

    def threadWorkerCopy(self, fileNameList):
        for i in range(16):
            t = threading.Thread(target=self.CopyWorker)
            t.daemon = True
            t.start()
        for fileName in fileNameList:
            fileQueue.put(fileName)
        fileQueue.join()

ThreadedCopy()
like image 115
Spencer Avatar answered Sep 23 '22 06:09

Spencer