I have a large file almost 20GB
, more than 20 mln lines and each line represents separate serialized JSON.
Reading file line by line
as a regular loop
and performing manipulation on line data takes a lot of time.
Is there any state of art
approach or best practices
for reading large files in parallel with smaller chunks in order to make processing faster?
I'm using Python 3.6.X
Unfortunately, no. Reading in files and operating on the lines read (such as json parsing or computation) is a CPU-bound operation, so there's no clever asyncio tactics to speed it up. In theory one could utilize multiprocessing and multiple cores to read and process in parallel, but having multiple threads reading the same file is bound to cause major problems. Because your file is so large, storing it all in memory and then parallelizing the computation is also going to be difficult.
Your best bet would be to head this problem off at the pass by partitioning the data (if possible) into multiple files, which could then open up safer doors to parallelism with multiple cores. Sorry there isn't a better answer AFAIK.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With