Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to improve performance of this counting program?

Given a file looks like this:

1440927 1
1727557 3
1440927 2
9917156 4

The first field is an ID which is in range(0, 200000000). The second field represents a type , which is in range(1, 5). And type 1 and type 2 belong to a common category S1, while type 3 and type 4 belong to S2. One single ID may have several records with different type. The file is about 200MB in size.

The problem is to count the number of IDs which has a record of type 1 or 2, and the number of IDs which has a record of type 3 or 4.

My code:

def gen(path):
    line_count = 0
    for line in open(path):
        tmp = line.split()
        id = int(tmp[0])
        yield id, int(tmp[1])

max_id = 200000000
S1 = bitarray.bitarray(max_id)
S2 = bitarray.bitarray(max_id)
for id, type in gen(path):
    if type != 3 and type != 4:
        S1[id] = True
    else:
        S2[id] = True

print S1.count(), S2.count()

Although it gives the answer, I think it runs a little slowly. What should I do to make it run faster?

EDIT: There are duplicated records in the file. And I only need to distinguish between S1(type 1 and type 2) and S2(type 3 and type 4). For example, 1440927 1 and 1440927 2 are counted only once but not twice because they belong to S1. So I have to store the IDs.

like image 479
amazingjxq Avatar asked Dec 06 '11 09:12

amazingjxq


1 Answers

You're using a iterator over the file, this means that you only buffer a few lines at the time. Every time the buffer is empty the disk needs to seek and your program has to wait.

200MB easily fits into your memory, so getting all lines will speed things up:

def gen(path):
    # load all the lines, 
    lines = open(path).readlines() 
    split = (line.split() for line in lines)
    return ((int(x), int(y)) for x,y in split)
like image 126
Jochen Ritzel Avatar answered Sep 21 '22 15:09

Jochen Ritzel