Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can I find intersection of two large file efficiently using python?

I have two large files. Their contents looks like this:

134430513
125296589
151963957
125296589

The file contains an unsorted list of ids. Some ids may appear more than one time in a single file.

Now I want to find the intersection part of two files. That is the ids appear in both files.

I just read the two files into 2 sets, s1 and s2. And get the intersection by s1.intersection(s2) . But it consumes a lot of memory and seems slow.

So is there any better or pythonic way to do this? If the file contains so many ids that can not be read into a set with limited memory, what can I do?

EDIT: I read the file into 2 sets using a generator:

def id_gen(path):
    for line in open(path):
        tmp = line.split()
        yield int(tmp[0])

c1 = id_gen(path)
s1 = set(c1)

All of the ids are numeric. And the max id may be 5000000000. If use bitarray, it will consume more memory.

like image 792
amazingjxq Avatar asked Sep 07 '11 09:09

amazingjxq


2 Answers

Others have shown the more idiomatic ways of doing this in Python, but if the size of the data really is too big, you can use the system utilities to sort and eliminate duplicates, then use the fact that a File is an iterator which returns one line at a time, doing something like:

import os
os.system('sort -u -n s1.num > s1.ns')
os.system('sort -u -n s2.num > s2.ns')
i1 = open('s1.ns', 'r')
i2 = open('s2.ns', 'r')
try:
    d1 = i1.next()
    d2 = i2.next()
    while True:
        if (d1 < d2):
            d1 = i1.next()
        elif (d2 < d1):
            d2 = i2.next()
        else:
            print d1,
            d1 = i1.next()
            d2 = i2.next()
except StopIteration:
    pass

This avoids having more than one line at a time (for each file) in memory (and the system sort should be faster than anything Python can do, as it is optimized for this one task).

like image 154
James Kanze Avatar answered Nov 16 '22 05:11

James Kanze


set(open(file1)) & set(open(file2))

which is equivalent to using intersection, is the most Pythonic way. You might be able to speed it up by doing

set(int(x) for x in open(file1)) & set(int(x) for x in open(file2))

since then you'll be storing and comparing integers rather than strings. This only works if all ids are numeric, of course.

If it's still not fast enough, you can turn to a slightly more imperative style:

# heuristic: set smaller_file and larger_file by checking the file size
a = set(int(x) for x in open(smaller_file))
# note: we're storing strings in r
r = set(x for x in open(larger_file) if int(x) in a)

If both files are guaranteed not to contain duplicates, you can also use a list to speed things up:

a = set(int(x) for x in open(smaller_file))
r = [x for x in open(larger_file) if int(x) in a]

Be sure to measure various solutions, and check whether you're not actually waiting for disk or network input.

like image 4
Fred Foo Avatar answered Nov 16 '22 03:11

Fred Foo