Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

zip() alternative for iterating through two iterables

Tags:

python

I have two large (~100 GB) text files that must be iterated through simultaneously.

Zip works well for smaller files but I found out that it's actually making a list of lines from my two files. This means that every line gets stored in memory. I don't need to do anything with the lines more than once.

handle1 = open('filea', 'r'); handle2 = open('fileb', 'r')

for i, j in zip(handle1, handle2):
    do something with i and j.
    write to an output file.
    no need to do anything with i and j after this.

Is there an alternative to zip() that acts as a generator that will allow me to iterate through these two files without using >200GB of ram?

like image 346
Austin Richardson Avatar asked Feb 24 '10 03:02

Austin Richardson


People also ask

Is zip faster than for loop?

When using write in both, there's no difference whatsoever. No, it's not faster. Only write seems to be faster than print . Your solution seemed to be about map instead of zip , not write instead of print .

How do I iterate two lists at the same time in Python?

Use the izip() Function to Iterate Over Two Lists in Python It iterates over the lists until the smallest of them gets exhausted. It then zips or maps the elements of both lists together and returns an iterator object. It returns the elements of both lists mapped together according to their index.

Why do you use the zip () method in Python?

Python's zip() function creates an iterator that will aggregate elements from two or more iterables. You can use the resulting iterator to quickly and consistently solve common programming problems, like creating dictionaries.


2 Answers

You can use izip_longest like this to pad the shorter file with empty lines

in python 2.6

from itertools import izip_longest
with handle1 as open('filea', 'r'):
    with handle2 as open('fileb', 'r'): 
        for i, j in izip_longest(handle1, handle2, fillvalue=""):
            ...

or in Python 3+

from itertools import zip_longest
with handle1 as open('filea', 'r'), handle2 as open('fileb', 'r'): 
    for i, j in zip_longest(handle1, handle2, fillvalue=""):
        ...
like image 122
John La Rooy Avatar answered Sep 29 '22 06:09

John La Rooy


itertools has a function izip that does that

from itertools import izip
for i, j in izip(handle1, handle2):
    ...

If the files are of different sizes you may use izip_longest, as izip will stop at the smaller file.

like image 35
Anurag Uniyal Avatar answered Sep 29 '22 04:09

Anurag Uniyal