def fancymatching(fname1, fname2):
#This function will do much smarter and fancy kinds of compares
if (fname1 == fname2):
return 1
else:
return 0
personlist = [
{
'pid':'1',
'fname':'john',
'mname':'a',
'lname':'smyth',
},{
'pid':'2',
'fname':'john',
'mnane':'a',
'lname':'smith',
},{
'pid':'3',
'fname':'bob',
'mname':'b',
'lname':'nope',
}
]
for person1 in personlist:
for person2 in personlist:
if person1['pid'] >= person2['pid']:
#don't check yourself, or ones that have been
continue
if fancymatching(person1['fname'], person2['fname']):
print (person1['pid'] + " matched " + person2['pid'])
I'm trying to improve on the idea of the above code. It works, but if personlist
becomes very large (say millions) I feel there must be something faster than 2 for loops.
What the code is doing is taking a list of dictionaries and running a fancy fuzzy matching function on the values of each dictionary against each other dictionary. So it's not as simple as just comparing all the dictionaries to the other ones. I'd like a way to run a function on each dictionary, maybe 2 for loops is the right way to do this? Any suggestions would be helpful!
There is no limitation on the chaining of loops. In the nested loop, the number of iterations will be equal to the number of iterations in the outer loop multiplied by the iterations in the inner loop. In each iteration of the outer loop inner loop execute all its iteration.
A faster way to loop in Python is using built-in functions. In our example, we could replace the for loop with the sum function. This function will sum the values inside the range of numbers.
HOWEVER, the conclusion is CONSISTENT: The nested loop is much FASTER. When the iteration time is 100^5, the difference is significant: 321.49 vs 210.05. There is about 1.53-time gap between them.
Effectively, yes, it's slow because it's a nested for loop, because of the size. Python's element in list operation works by just searching the entire list, element by element, for the one it wants.
You can use itertools.combinations
which is essentially the same double loop but it iterates faster because it's written in C (that only reduces the constant factor, you still have the O(n**2)
runtime behaviour) and you don't need the if person1['pid'] >= person2['pid']: continue
anymore (that's built into the combinations
function already).
from itertools import combinations
for person1, person2 in combinations(personlist, 2):
print(person1['fname'], person2['fname'])
which prints:
('john', 'john')
('john', 'bob')
('john', 'bob')
However if your fancymatching
allows it then you could also group (O(n)
runtime) your values. For example in your case you only match identical 'fname'
-values.
>>> matches = {}
>>> for person in personlist:
... matches.setdefault(person['fname'], []).append(person)
>>> matches
{'bob': [{'fname': 'bob', 'lname': 'nope', 'mname': 'b', 'pid': '3'}],
'john': [{'fname': 'john', 'lname': 'smyth', 'mname': 'a', 'pid': '1'},
{'fname': 'john', 'lname': 'smith', 'mnane': 'a', 'pid': '2'}]}
But that's only possible if your fancymatching
allows such a grouping. Which is True for your case but if it's more complicated it might not be.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With