Given a file looks like this:
1440927 1
1727557 3
1440927 2
9917156 4
The first field is an ID which is in range(0, 200000000)
. The second field represents a type , which is in range(1, 5)
. And type 1 and type 2 belong to a common category S1
, while type 3 and type 4 belong to S2
. One single ID may have several records with different type. The file is about 200MB in size.
The problem is to count the number of IDs which has a record of type 1 or 2, and the number of IDs which has a record of type 3 or 4.
My code:
def gen(path):
line_count = 0
for line in open(path):
tmp = line.split()
id = int(tmp[0])
yield id, int(tmp[1])
max_id = 200000000
S1 = bitarray.bitarray(max_id)
S2 = bitarray.bitarray(max_id)
for id, type in gen(path):
if type != 3 and type != 4:
S1[id] = True
else:
S2[id] = True
print S1.count(), S2.count()
Although it gives the answer, I think it runs a little slowly. What should I do to make it run faster?
EDIT:
There are duplicated records in the file. And I only need to distinguish between S1(type 1 and type 2) and S2(type 3 and type 4). For example, 1440927 1
and 1440927 2
are counted only once but not twice because they belong to S1. So I have to store the IDs.
You're using a iterator over the file, this means that you only buffer a few lines at the time. Every time the buffer is empty the disk needs to seek and your program has to wait.
200MB easily fits into your memory, so getting all lines will speed things up:
def gen(path):
# load all the lines,
lines = open(path).readlines()
split = (line.split() for line in lines)
return ((int(x), int(y)) for x,y in split)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With