Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Sort a list of dictionaries while consolidating duplicates in Python?

So I have a list of dictionaries like so:

data = [ { 
           'Organization' : '123 Solar',
           'Phone' : '444-444-4444',
           'Email' : '',
           'website' : 'www.123solar.com'
         }, {
           'Organization' : '123 Solar',
           'Phone' : '',
           'Email' : '[email protected]',
           'Website' : 'www.123solar.com'
         }, {
           etc...
         } ]

Of course, this is not the exact data. But (maybe) from my example here you can catch my problem. I have many records with the same "Organization" name, but not one of them has the complete information for that record.

Is there an efficient method for searching over the list, sorting the list based on the dictionary's first entry, and finally merging the data from duplicates to create a unique entry? (Keep in mind these dictionaries are quite large)

like image 497
Jacob Bridges Avatar asked Aug 27 '13 19:08

Jacob Bridges


2 Answers

You can make use of itertools.groupby:

from itertools import groupby
from operator import itemgetter
from pprint import pprint

data = [ {
           'Organization' : '123 Solar',
           'Phone' : '444-444-4444',
           'Email' : '',
           'website' : 'www.123solar.com'
         }, {
           'Organization' : '123 Solar',
           'Phone' : '',
           'Email' : '[email protected]',
           'Website' : 'www.123solar.com'
         },
         {
           'Organization' : '234 test',
           'Phone' : '111',
           'Email' : '[email protected]',
           'Website' : 'b.123solar.com'
         },
         {
           'Organization' : '234 test',
           'Phone' : '222',
           'Email' : '[email protected]',
           'Website' : 'bd.123solar.com'
         }]


data = sorted(data, key=itemgetter('Organization'))
result = {}
for key, group in groupby(data, key=itemgetter('Organization')):
    result[key] = [item for item in group]

pprint(result)

prints:

{'123 Solar': [{'Email': '',
                'Organization': '123 Solar',
                'Phone': '444-444-4444',
                'website': 'www.123solar.com'},
               {'Email': '[email protected]',
                'Organization': '123 Solar',
                'Phone': '',
                'Website': 'www.123solar.com'}],
 '234 test': [{'Email': '[email protected]',
               'Organization': '234 test',
               'Phone': '111',
               'Website': 'b.123solar.com'},
              {'Email': '[email protected]',
               'Organization': '234 test',
               'Phone': '222',
               'Website': 'bd.123solar.com'}]}

UPD:

Here's what you can do to group items into single dict:

for key, group in groupby(data, key=itemgetter('Organization')):
    result[key] = {'Phone': [],
                   'Email': [],
                   'Website': []}
    for item in group:
        result[key]['Phone'].append(item['Phone'])
        result[key]['Email'].append(item['Email'])
        result[key]['Website'].append(item['Website'])

then, in result you'll have:

{'123 Solar': {'Email': ['', '[email protected]'],
               'Phone': ['444-444-4444', ''],
               'Website': ['www.123solar.com', 'www.123solar.com']},
 '234 test': {'Email': ['[email protected]', '[email protected]'],
              'Phone': ['111', '222'],
              'Website': ['b.123solar.com', 'bd.123solar.com']}}
like image 182
alecxe Avatar answered Nov 13 '22 20:11

alecxe


Is there an efficient method for searching over the list, sorting the list based on the dictionary's first entry, and finally merging the data from duplicates to create a unique entry?

Yes, but there's an even more efficient method without searching and sorting. Just build up a dictionary as you go along:

datadict = {}
for thingy in data:
    organization = thingy['Organization']
    datadict[organization] = merge(thingy, datadict.get(organization, {}))

Now you've making a linear pass over the data, doing a constant-time lookup for each one. So, it's better than any sorted solution by a factor of O(log N). It's also one pass instead of multiple passes, and it will probably have lower constant overhead besides.


It's not clear exactly what you want to do to merge the entries, and there's no way anyone can write the code without knowing what rules you want to use. But here's a simple example:

def merge(d1, d2):
    for key, value in d2.items():
        if not d1.get(key):
            d1[key] = value
    return d1

In other words, for each item in d2, if d1 already has a truthy value (like a non-empty string), leave it alone; otherwise, add it.

like image 2
abarnert Avatar answered Nov 13 '22 18:11

abarnert