Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Efficiently processing ~50 million record file in python

We have a file with about 46 million records in CSV format. Each record has about 18 fields and one of them is a 64 byte ID. We have another file with about 167K unique IDs in it. The records corresponding to the IDs needs to be yanked out. So, we have written a python program that reads the 167K IDs into an array and processes the 46 million records file checking if the ID exists in each one of the those records. Here is the snippet of the code:

import csv
...
csvReadHandler = csv.reader(inputFile, delimiter=chr(1))
csvWriteHandler = csv.writer(outputFile, delimiter=chr(1), lineterminator='\n')
for fieldAry in csvReadHandler:
    lineCounts['orig'] += 1
    if fieldAry[CUSTOMER_ID] not in idArray:
        csvWriteHandler.writerow(fieldAry)
        lineCounts['mod'] += 1

Tested the program on a small set of data, here are the processing times:

lines: 117929 process time: 236.388447046 sec
lines: 145390 process time: 277.075321913 sec

We have started running the program on the 46 million records file (which is about 13 GB size) last night ~3:00am EST, now it is around 10am EST and it is still processing!

Questions:

  1. Is there a better way to process these records to improve on processing times?
  2. Is python the right choice? Would awk or some other tool help?
  3. I am guessing 64 byte ID look-up on 167K array in the following statement is the culprit:
    if fieldAry[CUSTOMER_ID] not in idArray:

Is there a better alternative?

Thank you!

Update: This is processed on an EC2 instance with EBS attached volume.

like image 809
S.A Avatar asked Mar 11 '16 15:03

S.A


2 Answers

You should must use a set instead of a list; before the for loop do:

idArray = set(idArray)

csvReadHandler = csv.reader(inputFile, delimiter=chr(1))
csvWriteHandler = csv.writer(outputFile, delimiter=chr(1), lineterminator='\n')
for fieldAry in csvReadHandler:
    lineCounts['orig'] += 1
    if fieldAry[CUSTOMER_ID] not in idArray:
        csvWriteHandler.writerow(fieldAry)
        lineCounts['mod'] += 1

And see the unbelievable speed-up; you're using days worth of unnecessary processing time just because you chose the wrong data structure.


in operator with set has O(1) time complexity, whereas O(n) time complexity with list. This might sound like "not a big deal" but actually it is the bottleneck in your script. Even though set will have somewhat higher constants for that O. So your code is using something like 30000 more time on that single in operation than is necessary. If in the optimal version it'd require 30 seconds, now you spend 10 days on that single line alone.

See the following test: I generate 1 million IDs and take 10000 aside into another list - to_remove. I then do a for loop like you do, doing the in operation for each record:

import random
import timeit

all_ids = [random.randint(1, 2**63) for i in range(1000000)]
to_remove = all_ids[:10000]
random.shuffle(to_remove)
random.shuffle(all_ids)


def test_set():
    to_remove_set = set(to_remove)
    for i in all_ids:
        if i in to_remove_set:
            pass

def test_list():
    for i in all_ids:
        if i in to_remove:
            pass


print('starting')
print('testing list', timeit.timeit(test_list, number=1))
print('testing set', timeit.timeit(test_set, number=1))

And the results:

testing list 227.91903045598883
testing set 0.14897623099386692

It took 149 milliseconds for the set version; the list version required 228 seconds. Now these were small numbers: in your case you have 50 million input records against my 1 million; thus there you need to multiply the testing set time by 50: with your data set it would take about 7.5 seconds.

The list version, on the other hand, you need to multiply that time by 50 * 17 - not only are there 50 times more input records, but 17 times more records to match against. Thus we get 227 * 50 * 17 = 192950.

So your algorithm spends 2.2 days doing something that by using correct data structure can be done in 7.5 seconds. Of course this does not mean that you can scan the whole 50 GB document in 7.5 seconds, but it probably ain't more than 2.2 days either. So we changed from:

             2 days                           2.2 days 
 |reading and writing the files||------- doing id in list ------|

to something like

             2 days            7.5 seconds (doing id in set)
 |reading and writing the files||

Disclaimer: Do not down vote without explaining why, because OP didn't include his entire code base or hardware/infrastructure design. But if i've made critical errors in my code or logic, explain them and down vote accordingly.

Lets start off with defining the bottle-necks you'll encounter (some obvious, some not).

  • Harddrives - They're slow and doesn't cache large quantities
  • Re-reading the same data multiple times
  • Memory, you can't store a 13GB file, or maybe you do and that's an option?

To solve these, you could go multiple routes.

  1. One clear benefitial way would be to read the large data into a database (postgresql or mariadb for instance). But I'll assume this isn't at all possible for now.

In regards to a CSV reader, they're fine but maybe not so efficient.
Since you're going to read through both files anyway I would consider the following:

  1. Read the 13GB file line by line and save the ID in a dictionary without checking if the key/value exists. (why? Because checking if the value exists is slower than just overwriting it and dictionaries also has a added bonus that keys are unique so duplicates will be weeded out) or add it to a set() as described by many others.

    followed by reading the smaller file line by line and checking your dict or set if it contains the ID.

dict() vs set() vs list()

Here's a comparison between the three data types set(), list() and dict():
code used: test.py

(11.719028949737549, 999950, 'Using dict() for inserts')
(1.8462610244750977, 'dict() time spent looking through the data')

(11.793760061264038, 999961, 'Using set() for inserts')
(7.019757986068726, 'set() time spent looking through the data')

(15min+, 'Using list()')  # Honestly, I never let it run to it's finish.. It's to slow.

As you can see dict is marginally quicker than set, while list() just falls behind completely (see description of why by Antti). I should point out my data is a bit skewed since it's a quick and dirty demo of the speed difference, but the general idea should still come across.

Using dict() as a solution

So if you do not have access to a database version of the source-data, and you need to use Python, id go with something along the lines of:

delimiter = b'\x01'
big_source = {}
with open('13_gig_source.log', 'rb') as fh:
    for line in fh:
        csv = line.split(delimiter)
        big_source[csv[4]] = None # 5:th column, change to match CUSTOMER_ID

output = open('result.log', 'wb')
with open('smaller_file.log', 'rb') as fh:
    for line in fh:
        csv = line.split(delimiter)
        if csv[4] in big_source:
            output.write(csv[4] + b'\n')
output.close()

Since I didn't know which column the data exists on i didn't optimize the split().
For instance, if it's the last column you're trying to fetch, do line.rsplit(delimiter, 1)[-1] instead. Or if it's the 3:d column do line.split(delimiter, 3)[2] because it will abort the process of looking for the next position of the delimiter in the split() functions.

Using linux tools

Yes, some tools might be better off suited for this such as awk because of the fact that it's a specific tool written in C to perform a very specific task. Even tho Python is based on C it still has a lot of abstraction layers on-top of that C code and will for the most part be slower than the counterpart C tools written for specific tasks.

Now I have no data to test this with, nor am i a PRO With Nix-Commands or PWN-C for short. So I'll leave someone else to give you examples of this, but i found this:

  • merge two csv files according to matching rows and add new columns in linux

And it might be helpful.

like image 20
Torxed Avatar answered Sep 21 '22 17:09

Torxed