Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to read / process large files in parallel with Python

Tags:

I have a large file almost 20GB, more than 20 mln lines and each line represents separate serialized JSON.

Reading file line by line as a regular loop and performing manipulation on line data takes a lot of time.

Is there any state of art approach or best practices for reading large files in parallel with smaller chunks in order to make processing faster?

I'm using Python 3.6.X

like image 474
Nodirbek Shamsiev Avatar asked Jun 01 '18 04:06

Nodirbek Shamsiev


1 Answers

Unfortunately, no. Reading in files and operating on the lines read (such as json parsing or computation) is a CPU-bound operation, so there's no clever asyncio tactics to speed it up. In theory one could utilize multiprocessing and multiple cores to read and process in parallel, but having multiple threads reading the same file is bound to cause major problems. Because your file is so large, storing it all in memory and then parallelizing the computation is also going to be difficult.

Your best bet would be to head this problem off at the pass by partitioning the data (if possible) into multiple files, which could then open up safer doors to parallelism with multiple cores. Sorry there isn't a better answer AFAIK.

like image 168
BowlingHawk95 Avatar answered Oct 11 '22 13:10

BowlingHawk95