Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python 2.6: How can I compare two lists of same object types on one particular field, efficiently?

I have a class called "UserDatabaseRecord". It has a bunch of fields like "username", "expiration_date", and so on.

I have TWO lists of UserDatabaseRecord objects: List A and List B

I want to verify that for all UserDatabaseRecords in List A, the username field does not match any UserDatabaseRecords username field in List B.

I'm able to accomplish this very inefficiently:

for record_a in List_A:
   for record_b in List_B:
      if record_a.username == record_b.username:
         print "Duplicate username: {0}".format(record_a.username)

It works, I guess. I'd just like to make it more efficient and/or "Pythonic".

This question is related, but ultimately I couldn't figure out how to apply it to a List of objects when comparing on just one field: one-liner to check if at least one item in list exists in another list?

like image 827
CptSupermrkt Avatar asked Dec 11 '22 08:12

CptSupermrkt


1 Answers

The problem with this is that, for each element in list A, you're checking all of the elements in list B. So, if the lengths of your lists are N and M, that's N*M comparisons.

If you make a set of the usernames from list B, then you can just use the in operator on it—which is not only simpler, it's instantaneous, instead of having to check all of the values one by one. So, you only need N lookups instead of N*M.

So:

b_names = {record.username for record in List_B}
for record_a in List_A:
    if record_a.username in b_names:
        print "Duplicate username: {0}".format(record_a.username)

Or, even simpler, use set intersection:

a_names = {record.username for record in List_A}
b_names = {record.username for record in List_B}
for name in a_names & b_names:
    print "Duplicate username: {0}".format(name)

And really, you don't need both of them to be sets, you can make one a set and the other just an iterator, using a generator expression:

a_names = {record.username for record in List_A}
b_names = (record.username for record in List_B)
for name in a_names.intersection(b_names):
    print "Duplicate username: {0}".format(name)

One of these may be a little faster than the others, but they'll all be in the same ballpark—and, more importantly, they're all linear instead of quadratic. So, I'd suggest using whichever one makes the most sense to you.

If you just need to know whether there are duplicates rather than get a list of them, or just need to get one of the duplicates arbitrarily rather than all of them, you can speed it up by "short-circuiting" early—e.g., adding a break after the print in the first one, or using isdisjoint instead of intersection in the last one.

like image 193
abarnert Avatar answered Apr 08 '23 03:04

abarnert