What is the best way to load multiple files into memory in parallel using python 3.6?

Question

I have 6 large files which each of them contains a dictionary object that I saved in a hard disk using pickle function. It takes about 600 seconds to load all of them in sequential order. I want to start loading all them at the same time to speed up the process. Suppose all of them have the same size, I hope to load them in 100 seconds instead. I used multiprocessing and apply_async to load each of them separately but it runs like sequential. This is the code I used and it doesn't work. The code is for 3 of these files but it would be the same for six of them. I put the 3rd file in another hard disk to make sure the IO is not limited.

def loadMaps():    
    start = timeit.default_timer()
    procs = []
    pool = Pool(3)
    pool.apply_async(load1(),)
    pool.apply_async(load2(),)
    pool.apply_async(load3(),)
    pool.close()
    pool.join()
    stop = timeit.default_timer()
    print('loadFiles takes in %.1f seconds' % (stop - start))

user4815162342 · Accepted Answer

If your code is primarily limited by IO and the files are on multiple disks, you might be able to speed it up using threads:

import concurrent.futures
import pickle

def read_one(fname):
    with open(fname, 'rb') as f:
        return pickle.load(f)

def read_parallel(file_names):
    with concurrent.futures.ThreadPoolExecutor() as executor:
        futures = [executor.submit(read_one, f) for f in file_names]
        return [fut.result() for fut in futures]

The GIL will not force IO operations to run serialized because Python consistently releases it when doing IO.

Several remarks on alternatives:

multiprocessing is unlikely to help because, while it guarantees to do its work in multiple processes (and therefore free of the GIL), it also requires the content to be transferred between the subprocess and the main process, which takes additional time.
asyncio will not help you at all because it doesn't natively support asynchronous file system access (and neither do the popular OS'es). While it can emulate it with threads, the effect is the same as the code above, only with much more ceremony.
Neither option will speed up loading the six files by a factor of six. Consider that at least some of the time is spent creating the dictionaries, which will be serialized by the GIL. If you want to really speed up startup, a better approach is not to create the whole dictionary upfront and switch to an in-file database, possibly using the dictionary to cache access to its content.

What is the best way to load multiple files into memory in parallel using python 3.6?

Tags:

python

python-3.x

parallel-processing

python-multiprocessing

python-asyncio

eSadr

1 Answers

user4815162342

Recent Activity

Donate For Us

What is the best way to load multiple files into memory in parallel using python 3.6?

Tags:

python

python-3.x

parallel-processing

python-multiprocessing

python-asyncio

eSadr

1 Answers

user4815162342

Related questions

Recent Activity

Donate For Us