We have a file with about 46 million records in CSV format. Each record has about 18 fields and one of them is a 64 byte ID. We have another file with about 167K unique IDs in it. The records corresponding to the IDs needs to be yanked out. So, we have written a python program that reads the 167K IDs into an array and processes the 46 million records file checking if the ID exists in each one of the those records. Here is the snippet of the code:
import csv
...
csvReadHandler = csv.reader(inputFile, delimiter=chr(1))
csvWriteHandler = csv.writer(outputFile, delimiter=chr(1), lineterminator='\n')
for fieldAry in csvReadHandler:
lineCounts['orig'] += 1
if fieldAry[CUSTOMER_ID] not in idArray:
csvWriteHandler.writerow(fieldAry)
lineCounts['mod'] += 1
Tested the program on a small set of data, here are the processing times:
lines: 117929 process time: 236.388447046 sec
lines: 145390 process time: 277.075321913 sec
We have started running the program on the 46 million records file (which is about 13 GB size) last night ~3:00am EST, now it is around 10am EST and it is still processing!
Questions:
if fieldAry[CUSTOMER_ID] not in idArray:
Is there a better alternative?
Thank you!
Update: This is processed on an EC2 instance with EBS attached volume.
You should must use a set
instead of a list
; before the for
loop do:
idArray = set(idArray)
csvReadHandler = csv.reader(inputFile, delimiter=chr(1))
csvWriteHandler = csv.writer(outputFile, delimiter=chr(1), lineterminator='\n')
for fieldAry in csvReadHandler:
lineCounts['orig'] += 1
if fieldAry[CUSTOMER_ID] not in idArray:
csvWriteHandler.writerow(fieldAry)
lineCounts['mod'] += 1
And see the unbelievable speed-up; you're using days worth of unnecessary processing time just because you chose the wrong data structure.
in
operator with set
has O(1) time complexity, whereas O(n) time complexity with list
. This might sound like "not a big deal" but actually it is the bottleneck in your script. Even though set
will have somewhat higher constants for that O. So your code is using something like 30000 more time on that single in
operation than is necessary. If in the optimal version it'd require 30 seconds, now you spend 10 days on that single line alone.
See the following test: I generate 1 million IDs and take 10000 aside into another list - to_remove
. I then do a for
loop like you do, doing the in
operation for each record:
import random
import timeit
all_ids = [random.randint(1, 2**63) for i in range(1000000)]
to_remove = all_ids[:10000]
random.shuffle(to_remove)
random.shuffle(all_ids)
def test_set():
to_remove_set = set(to_remove)
for i in all_ids:
if i in to_remove_set:
pass
def test_list():
for i in all_ids:
if i in to_remove:
pass
print('starting')
print('testing list', timeit.timeit(test_list, number=1))
print('testing set', timeit.timeit(test_set, number=1))
And the results:
testing list 227.91903045598883
testing set 0.14897623099386692
It took 149 milliseconds for the set
version; the list
version required 228 seconds. Now these were small numbers: in your case you have 50 million input records against my 1 million; thus there you need to multiply the testing set
time by 50: with your data set it would take about 7.5 seconds.
The list version, on the other hand, you need to multiply that time by 50 * 17 - not only are there 50 times more input records, but 17 times more records to match against. Thus we get 227 * 50 * 17 = 192950.
So your algorithm spends 2.2 days doing something that by using correct data structure can be done in 7.5 seconds. Of course this does not mean that you can scan the whole 50 GB document in 7.5 seconds, but it probably ain't more than 2.2 days either. So we changed from:
2 days 2.2 days
|reading and writing the files||------- doing id in list ------|
to something like
2 days 7.5 seconds (doing id in set)
|reading and writing the files||
Disclaimer: Do not down vote without explaining why, because OP didn't include his entire code base or hardware/infrastructure design. But if i've made critical errors in my code or logic, explain them and down vote accordingly.
Lets start off with defining the bottle-necks you'll encounter (some obvious, some not).
In regards to a CSV reader, they're fine but maybe not so efficient.
Since you're going to read through both files anyway I would consider the following:
Read the 13GB file line by line and save the ID
in a dictionary without checking if the key/value exists. (why? Because checking if the value exists is slower than just overwriting it and dictionaries also has a added bonus that keys are unique so duplicates will be weeded out) or add it to a set()
as described by many others.
followed by reading the smaller file line by line and checking your dict
or set
if it contains the ID
.
dict()
vs set()
vs list()
Here's a comparison between the three data types set()
, list()
and dict()
:
code used: test.py
(11.719028949737549, 999950, 'Using dict() for inserts')
(1.8462610244750977, 'dict() time spent looking through the data')
(11.793760061264038, 999961, 'Using set() for inserts')
(7.019757986068726, 'set() time spent looking through the data')
(15min+, 'Using list()') # Honestly, I never let it run to it's finish.. It's to slow.
As you can see dict
is marginally quicker than set
, while list()
just falls behind completely (see description of why by Antti). I should point out my data is a bit skewed since it's a quick and dirty demo of the speed difference, but the general idea should still come across.
So if you do not have access to a database version of the source-data, and you need to use Python, id go with something along the lines of:
delimiter = b'\x01'
big_source = {}
with open('13_gig_source.log', 'rb') as fh:
for line in fh:
csv = line.split(delimiter)
big_source[csv[4]] = None # 5:th column, change to match CUSTOMER_ID
output = open('result.log', 'wb')
with open('smaller_file.log', 'rb') as fh:
for line in fh:
csv = line.split(delimiter)
if csv[4] in big_source:
output.write(csv[4] + b'\n')
output.close()
Since I didn't know which column the data exists on i didn't optimize the split()
.
For instance, if it's the last column you're trying to fetch, do line.rsplit(delimiter, 1)[-1]
instead. Or if it's the 3:d column do line.split(delimiter, 3)[2]
because it will abort the process of looking for the next position of the delimiter
in the split()
functions.
Yes, some tools might be better off suited for this such as awk
because of the fact that it's a specific tool written in C to perform a very specific task. Even tho Python is based on C it still has a lot of abstraction layers on-top of that C code and will for the most part be slower than the counterpart C tools written for specific tasks.
Now I have no data to test this with, nor am i a PRO With Nix-Commands or PWN-C for short. So I'll leave someone else to give you examples of this, but i found this:
And it might be helpful.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With