Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python loop optimization

I am currently working with a file with more than 2 million lines. I've separated the lines into lists of elements (ex: [a,b,c,d] = 1 line, words separated).

I'm trying to use the following code to go through all the lines:

for a in aud:
    for esps in final:
        if a[0] in final[esps]:
            a[0] = esps

In the first for loop I'm referring to the 2 million+ lines. In the second for loop it goes through a dictionary with 2010 keys, each key with probably at least 50 corresponding values. I want to find the a[0] element in the lines that are equal to the values in the dictionary. If they match, I change the a[0] element in the selected line to the value of the key of the dictionary.

The problem is that this code takes ages to run and I don't understand much (nothing) about optimization and how to run this much faster. I would thank a lot if anyone could tell me how to do something like this faster.

like image 806
Targaryel Avatar asked May 07 '17 02:05

Targaryel


1 Answers

When you have "big" things to run through, like this, the key to get things going fast is to "reduce algorithmic complexity" - that is, avoid any operations that depend on the size of either data set if possible.

In the example you gave, you perform, for each of your millions of lines a 50 x 2000 linear search - that is a lot! The problem being that if each of your final[esps] is a list, Python performs a linear search in these 50 values - with the operator in.

Since you mention you are reading your values from a file, I have to assume that both a[0] and the elements in the lines of final are strings - but this would also work for numbers.

A first, very simple optimization, is to simply change your final dictionary rows from lists into sets - with a set the match from the in operator changes from being linear to be in constant time (from O(m) to O(1) ) - so, you basically cut your search time by a factor of 50 if before running the code in your example you do:

for key in final:
   final[key] = set(final[key])

But you are still performing a linear search in each of the 2010 keys of final. The way to change that into a constant search is to create a reversed dictionary - in which each of the 50 values in a row of final point to the key esp instead. Then you just use a[0] as the key in this reversed dictionary - and you are replacing a linear search in 100000 items (2000 x 50) for a search in constant time in a dictionary;

That is easy to accomplish - just change your code to:

rfinal = {}
for esp, values in final.items():
   for value in values:
       rfinal[value] = esp


for a in aud:
    if a[0] in rfinal:
       a[0] = rfinal[a[0]]
    else:
       # code for when there is no match for a[0]
       ...
like image 167
jsbueno Avatar answered Oct 05 '22 10:10

jsbueno