Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

python analysing two large files simultaneously line by line

Tags:

python

I'm trying to analyse two ±6 gb files. I need to analyse them simultaneously, because I need two lines at the same time (one from each file). I tried to do something like this:

with open(fileOne, "r") as First_file:
    for index, line in enumerate(First_file):

        # Do some stuff here

    with open(fileTwo, "r") as Second_file:
        for index, line in enumerate(Second_file):

            # Do stuff here aswell

The problem is that in the second "with open" loop starts at the beginning of the file. So the time is takes to do the analysis will take way to long. I also tried this:

with open(fileOne, "r") as f1, open(fileTwo, "r") as f2:
    for index, (line_R1, line_R2) in enumerate(zip(f1, f2)):

The problem is that both files are loaded directly into the memory. I need the same line from each file. The correct line is:

number_line%4 == 1

This will give line 2, 5, 9, 13 ect. I need those lines from both files.

Is there a faster way and more memory-efficient way to do this?

like image 355
TheBumpper Avatar asked Feb 12 '23 23:02

TheBumpper


1 Answers

In Python 2, use itertools.izip() to prevent the files being loaded into memory:

from itertools import izip

with open(fileOne, "r") as f1, open(fileTwo, "r") as f2:
    for index, (line_R1, line_R2) in enumerate(izip(f1, f2)):

The built-in zip() function will indeed read both file objects into memory in their entirety, izip() retrieves lines one at a time.

like image 79
Martijn Pieters Avatar answered Mar 07 '23 12:03

Martijn Pieters