<pre class="prettyprint"><code>def fancymatching(fname1, fname2): #This function will do much smarter and fancy kinds of compares if (fname1 == fname2): return 1 else: return 0 personlist = [ { 'pid':'1', 'fname':'john', 'mname':'a', 'lname':'smyth', },{ 'pid':'2', 'fname':'john', 'mnane':'a', 'lname':'smith', },{ 'pid':'3', 'fname':'bob', 'mname':'b', 'lname':'nope', } ] for person1 in personlist: for person2 in personlist: if person1['pid'] >= person2['pid']: #don't check yourself, or ones that have been continue if fancymatching(person1['fname'], person2['fname']): print (person1['pid'] + " matched " + person2['pid']) </code></pre> I'm trying to improve on the idea of the above code. It works, but if <code>personlist</code> becomes very large (say millions) I feel there must be something faster than 2 for loops. What the code is doing is taking a list of dictionaries and running a fancy fuzzy matching function on the values of each dictionary against each other dictionary. So it's not as simple as just comparing all the dictionaries to the other ones. I'd like a way to run a function on each dictionary, maybe 2 for loops is the right way to do this? Any suggestions would be helpful!

You can use <code>itertools.combinations</code> which is essentially the same double loop but it iterates faster because it's written in C (that only reduces the constant factor, you still have the <code>O(n**2)</code> runtime behaviour) and you don't need the <code>if person1['pid'] >= person2['pid']: continue</code> anymore (that's built into the <code>combinations</code> function already). <pre class="prettyprint"><code>from itertools import combinations for person1, person2 in combinations(personlist, 2): print(person1['fname'], person2['fname']) </code></pre> which prints: <pre class="prettyprint"><code>('john', 'john') ('john', 'bob') ('john', 'bob') </code></pre> <hr> However if your <code>fancymatching</code> allows it then you could also group (<code>O(n)</code> runtime) your values. For example in your case you only match identical <code>'fname'</code>-values. <pre class="prettyprint"><code>>>> matches = {} >>> for person in personlist: ... matches.setdefault(person['fname'], []).append(person) >>> matches {'bob': [{'fname': 'bob', 'lname': 'nope', 'mname': 'b', 'pid': '3'}], 'john': [{'fname': 'john', 'lname': 'smyth', 'mname': 'a', 'pid': '1'}, {'fname': 'john', 'lname': 'smith', 'mnane': 'a', 'pid': '2'}]} </code></pre> But that's only possible if your <code>fancymatching</code> allows such a grouping. Which is True for your case but if it's more complicated it might not be.

Python - something faster than 2 nested for loops

Tags:

performance

python

dictionary

list

for-loop

def fancymatching(fname1, fname2):
#This function will do much smarter and fancy kinds of compares
    if (fname1 == fname2):
        return 1
    else:
        return 0

personlist = [
{ 
'pid':'1',
'fname':'john',
'mname':'a',
'lname':'smyth',
},{ 
'pid':'2',
'fname':'john',
'mnane':'a',
'lname':'smith',
},{ 
'pid':'3',
'fname':'bob',
'mname':'b',
'lname':'nope',
}
]

for person1 in personlist:
    for person2 in personlist:
        if person1['pid'] >= person2['pid']:
            #don't check yourself, or ones that have been
        continue
        if fancymatching(person1['fname'], person2['fname']):
            print (person1['pid'] + " matched " + person2['pid'])

I'm trying to improve on the idea of the above code. It works, but if personlist becomes very large (say millions) I feel there must be something faster than 2 for loops.

What the code is doing is taking a list of dictionaries and running a fancy fuzzy matching function on the values of each dictionary against each other dictionary. So it's not as simple as just comparing all the dictionaries to the other ones. I'd like a way to run a function on each dictionary, maybe 2 for loops is the right way to do this? Any suggestions would be helpful!

969

asked Feb 16 '17 20:02

sniperd

1 Answers

You can use itertools.combinations which is essentially the same double loop but it iterates faster because it's written in C (that only reduces the constant factor, you still have the O(n**2) runtime behaviour) and you don't need the if person1['pid'] >= person2['pid']: continue anymore (that's built into the combinations function already).

from itertools import combinations

for person1, person2 in combinations(personlist, 2):
    print(person1['fname'], person2['fname'])

which prints:

('john', 'john')
('john', 'bob')
('john', 'bob')

However if your fancymatching allows it then you could also group (O(n) runtime) your values. For example in your case you only match identical 'fname'-values.

>>> matches = {}
>>> for person in personlist:
...     matches.setdefault(person['fname'], []).append(person)
>>> matches
{'bob': [{'fname': 'bob', 'lname': 'nope', 'mname': 'b', 'pid': '3'}],
 'john': [{'fname': 'john', 'lname': 'smyth', 'mname': 'a', 'pid': '1'}, 
          {'fname': 'john', 'lname': 'smith', 'mnane': 'a', 'pid': '2'}]}

But that's only possible if your fancymatching allows such a grouping. Which is True for your case but if it's more complicated it might not be.

138

answered Oct 13 '22 13:10

MSeifert

Related questions
                            
                                error inserting values to db with psycopg2 module
                            
                                How to remove NaN from a Pandas Series where the dtype is a list?
                            
                                addHow to make django post_save signal run only during creation
                            
                                How can I configure IPython to issue the same "magic" commands at every startup?
                            
                                Finding minimum value for each level of a multi-index dataframe
                            
                                python logging: sending StreamHandler to file from command line
                            
                                No response from celery worker with TensorFlow
                            
                                use AWS APIs with Python to use Polly Services
                            
                                Correlation between a pandas Series and a whole DataFrame
                            
                                object of type '_csv.reader' has no len(), csv data not recognized
                            
                                is boto3 supported by ansible?
                            
                                ImportError: No module named custom storages - django-storages boto
                            
                                Python's dir(object) and __builtin__ equivalent in Julia
                            
                                Calculate the sum of model properties in Django
                            
                                TensorArray TensorArray_1_0: Could not read from TensorArray index 0 because it has not yet been written to
                            
                                Importing tensorflow when embedding python in c++ returns null
                            
                                Paramiko: nest ssh session to another machine while preserving paramiko functionality (ProxyJump)
                            
                                TensorFlow - How to predict with trained model on a different test dataset?
                            
                                docker stucks when executing time.sleep(1) in a python loop
                            
                                Python Pandas groupby: filter according to condition on values

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With