I am currently trying to read a large file (80 million lines), where I need to make a computationally intensive matrix multiplication for each entry. After calculating this, I want to insert the result into a database. Because of the time intensive manner of this process, I want to split the file onto multiple cores to speed up the process.
After researching I found this promising attempt, which split a file into n parts.
def file_block(fp, number_of_blocks, block):
'''
A generator that splits a file into blocks and iterates
over the lines of one of the blocks.
'''
assert 0 <= block and block < number_of_blocks
assert 0 < number_of_blocks
fp.seek(0,2)
file_size = fp.tell()
ini = file_size * block / number_of_blocks
end = file_size * (1 + block) / number_of_blocks
if ini <= 0:
fp.seek(0)
else:
fp.seek(ini-1)
fp.readline()
while fp.tell() < end:
yield fp.readline()
Iteratively, you can call the function like this:
if __name__ == '__main__':
fp = open(filename)
number_of_chunks = 4
for chunk_number in range(number_of_chunks):
print chunk_number, 100 * '='
for line in file_block(fp, number_of_chunks, chunk_number):
process(line)
While this works, I run into problems, parallelizing this using multiprocessing:
fp = open(filename)
number_of_chunks = 4
li = [file_block(fp, number_of_chunks, chunk_number) for chunk_number in range(number_of_chunks)]
p = Pool(cpu_count() - 1)
p.map(processChunk,li)
With the error being, that generators cannot be pickled.
While I understand this error, it is too expensive to first iterate over the whole file to put all lines into a list.
Moreover, I want to use blocks of lines per core per iteration, because it is more efficient to insert multiple lines into the database at once (instead of 1 by 1 if using the typical map approach)
Thanks for your help.
Key Takeaways. Python is NOT a single-threaded language. Python processes typically use a single thread because of the GIL. Despite the GIL, libraries that perform computationally heavy tasks like numpy, scipy and pytorch utilise C-based implementations under the hood, allowing the use of multiple cores.
Yes, a single process can run multiple threads on different cores. Caching is specific to the hardware. Many modern Intel processors have three layers of caching, where the last level cache is shared across cores.
How many CPUs (or cores) will the Python threading library take advantage of simultaneously? Python threading is restricted to a single CPU at one time. The multiprocessing library will allow you to run code on different processors. Python threading is restricted to a single CPU at one time.
Instead of creating generators up front and passing them into each thread, leave that to the thread code.
def processChunk(params):
filename, chunk_number, number_of_chunks = params
with open(filename, 'r') as fp:
for line in file_block(fp, number_of_chunks, chunk_number):
process(line)
li = [(filename, i, number_of_chunks) for i in range(number_of_chunks)]
p.map(processChunk, li)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With