How to read / process large files in parallel with Python

Question

I have a large file almost 20GB, more than 20 mln lines and each line represents separate serialized JSON.

Reading file line by line as a regular loop and performing manipulation on line data takes a lot of time.

Is there any state of art approach or best practices for reading large files in parallel with smaller chunks in order to make processing faster?

I'm using Python 3.6.X

BowlingHawk95 · Accepted Answer

Unfortunately, no. Reading in files and operating on the lines read (such as json parsing or computation) is a CPU-bound operation, so there's no clever asyncio tactics to speed it up. In theory one could utilize multiprocessing and multiple cores to read and process in parallel, but having multiple threads reading the same file is bound to cause major problems. Because your file is so large, storing it all in memory and then parallelizing the computation is also going to be difficult.

Your best bet would be to head this problem off at the pass by partitioning the data (if possible) into multiple files, which could then open up safer doors to parallelism with multiple cores. Sorry there isn't a better answer AFAIK.

How to read / process large files in parallel with Python

Tags:

Nodirbek Shamsiev

1 Answers

BowlingHawk95

Recent Activity

Donate For Us

How to read / process large files in parallel with Python

Tags:

Nodirbek Shamsiev

1 Answers

BowlingHawk95

Related questions

Recent Activity

Donate For Us