I am currently working with a file with more than 2 million lines. I've separated the lines into lists of elements (ex: [a,b,c,d]
= 1 line, words separated).
I'm trying to use the following code to go through all the lines:
for a in aud:
for esps in final:
if a[0] in final[esps]:
a[0] = esps
In the first for loop I'm referring to the 2 million+ lines. In the second for loop it goes through a dictionary with 2010 keys, each key with probably at least 50 corresponding values. I want to find the a[0]
element in the lines that are equal to the values in the dictionary. If they match, I change the a[0]
element in the selected line to the value of the key of the dictionary.
The problem is that this code takes ages to run and I don't understand much (nothing) about optimization and how to run this much faster. I would thank a lot if anyone could tell me how to do something like this faster.
When you have "big" things to run through, like this, the key to get things going fast is to "reduce algorithmic complexity" - that is, avoid any operations that depend on the size of either data set if possible.
In the example you gave, you perform, for each of your millions of lines a 50 x 2000 linear search - that is a lot! The problem being that if each of your final[esps]
is a list, Python performs a linear search in these 50 values - with the operator in
.
Since you mention you are reading your values from a file, I have to assume that both a[0] and the elements in the lines of final
are strings - but this would also work for numbers.
A first, very simple optimization, is to simply change your final
dictionary rows from lists into set
s - with a set
the match from the in
operator changes from being linear to be in constant time (from O(m) to O(1) ) - so, you basically cut your search time by a factor of 50 if before running the code in your example you do:
for key in final:
final[key] = set(final[key])
But you are still performing a linear search in each of the 2010 keys of final
. The way to change that into a constant search is to create a reversed dictionary - in which each of the 50 values in a row of final
point to the key esp
instead. Then you just use a[0] as the key in this reversed dictionary - and you are replacing a linear search in 100000 items (2000 x 50) for a search in constant time in a dictionary;
That is easy to accomplish - just change your code to:
rfinal = {}
for esp, values in final.items():
for value in values:
rfinal[value] = esp
for a in aud:
if a[0] in rfinal:
a[0] = rfinal[a[0]]
else:
# code for when there is no match for a[0]
...
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With