I am trying to write a multithreaded program in Python to accelerate the copying of (under 1000) .csv files. The multithreaded code runs even slower than the sequential approach. I timed the code with profile.py
. I am sure I must be doing something wrong but I'm not sure what.
The Environment:
The Approach:
I put all the file paths in a Queue, and create 4-8 worker threads pull file paths from the queue and copy the designated file. In no case is the multithreaded code faster:
I assume this is an I/O bound task, so multithreading should help the operation speed.
The Code:
import Queue
import threading
import cStringIO
import os
import shutil
import timeit # time the code exec with gc disable
import glob # file wildcards list, glob.glob('*.py')
import profile #
fileQueue = Queue.Queue() # global
srcPath = 'C:\\temp'
destPath = 'D:\\temp'
tcnt = 0
ttotal = 0
def CopyWorker():
while True:
fileName = fileQueue.get()
fileQueue.task_done()
shutil.copy(fileName, destPath)
#tcnt += 1
print 'copied: ', tcnt, ' of ', ttotal
def threadWorkerCopy(fileNameList):
print 'threadWorkerCopy: ', len(fileNameList)
ttotal = len(fileNameList)
for i in range(4):
t = threading.Thread(target=CopyWorker)
t.daemon = True
t.start()
for fileName in fileNameList:
fileQueue.put(fileName)
fileQueue.join()
def sequentialCopy(fileNameList):
#around 160.446 seconds, 152 seconds
print 'sequentialCopy: ', len(fileNameList)
cnt = 0
ctotal = len(fileNameList)
for fileName in fileNameList:
shutil.copy(fileName, destPath)
cnt += 1
print 'copied: ', cnt, ' of ', ctotal
def main():
print 'this is main method'
fileCount = 0
fileList = glob.glob(srcPath + '\\' + '*.csv')
#sequentialCopy(fileList)
threadWorkerCopy(fileList)
if __name__ == '__main__':
profile.run('main()')
Of course it's slower. The hard drives are having to seek between the files constantly. Your belief that multi-threading would make this task faster is completely unjustified. The limiting speed is how fast you can read data from or write data to the disk, and every seek from one file to another is a loss of time that could have been spent transferring data.
I think I can verify that it is a disk I/O situation. I did a similar test on my machine, copying from an extremely fast network server back onto itself and I saw nearly a 1:1 speed increase just using your code above (4 threads). My test was copying 4137 files totaling 16.5G:
Sequential copy was 572.033 seconds.
Threaded (4) copy was 180.093 seconds.
Threaded (10) copy was 110.155
Threaded (20) copy was 86.745
Threaded (40) copy was 87.761
As you can see there is a bit of a "falloff" as you get into higher and higher thread counts, but at 4 threads I had a huge speed increase. I'm on a VERY fast computer with a very fast network connection, so I think I can safely assume that you are hitting an I/O limit.
That said, check out the resonse I got here: Python multiprocess/multithreading to speed up file copying. I haven't had a chance to try this code out yet but it is possible that gevent could be faster.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With