Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Join two defaultdicts in Python

I parsed a huge database of bibliographic records (about 20 million records). Each record has unique ID field, a set of authors and a set of term/keywords that describe main content of the bibliographic record. For example, a typical bibliographic record looks like:

ID: 001
Author: author1
Author: author2
Term: term1
Term: term2

First, I create two defaultdicts to store authors and terms:

d1 = defaultdict(lambda : defaultdict(list))
d2 = defaultdict(lambda : defaultdict(list))

Next, I populate authors:

d1['id001'] = ['author1', 'author2'] 
d1['id002'] = ['author3'] 
d1['id003'] = ['author1', 'author4'] 

and keywords:

d2['id001'] = ['term1', 'term2']  
d2['id002'] = ['term2', 'term3']
d2['id003'] = ['term4']

The problem is how to join these two dictionaries to obtain data object which links between authors and terms directly:

author1|term1,term2,term4
author2|term1,term2
author3|term2,term3
author4|term4

I have two questions:

  • Is proposed approach appropriate or should I store/represent data in some other way?
  • Could you please roughly suggest how to join both dictionaries?
like image 229
Andrej Avatar asked Aug 31 '25 21:08

Andrej


2 Answers

This is one way. Note, as demonstrated below, you do not need to use nested dictionaries or a defaultdict for your initial step.

from collections import defaultdict

d1 = {}
d2 = {}

d1['id001'] = ['author1', 'author2'] 
d1['id002'] = ['author3'] 
d1['id003'] = ['author1', 'author4'] 

d2['id001'] = ['term1', 'term2']  
d2['id002'] = ['term2', 'term3']
d2['id003'] = ['term4']

res = defaultdict(list)

for ids in set(d1) & set(d2):
    for v in d1[ids]:
        res[v].extend(d2[ids])

res = {k: sorted(v) for k, v in res.items()}

# {'author1': ['term1', 'term2', 'term4'],
#  'author2': ['term1', 'term2'],
#  'author3': ['term2', 'term3'],
#  'author4': ['term4']}
like image 94
jpp Avatar answered Sep 03 '25 09:09

jpp


The key of those problems is to build temporary dictionaries "properly oriented" from the existing ones. Once that is done, it's much clearer (and the complexity is good thanks to proper dict lookup)

Here's my solution:

First create a dict author => ids from d1.

Then create the result (a dict author => terms). Loop in the created author => ids dict and populate the result with the flattened values of d2.

d1=dict()
d2=dict()

d1['id001'] = ['author1', 'author2']
d1['id002'] = ['author3']
d1['id003'] = ['author1', 'author4']

d2['id001'] = ['term1', 'term2']
d2['id002'] = ['term2', 'term3']
d2['id003'] = ['term4']

import collections

authors_id = collections.defaultdict(list)
for k,v in d1.items():
    for a in v:
        authors_id[a].append(k)

print(dict(authors_id)) # convert to dict for clearer printing


authors_term = collections.defaultdict(list)
for k,v in authors_id.items():
    for a in v:
        for i in d2[a]:
            authors_term[k].append(i)

print(dict(authors_term)) # convert to dict for clearer printing

result:

{'author4': ['id003'], 'author3': ['id002'], 'author1': ['id001', 'id003'], 'author2': ['id001']}
{'author3': ['term2', 'term3'], 'author4': ['term4'], 'author1': ['term1', 'term2', 'term4'], 'author2': ['term1', 'term2']}
like image 42
Jean-François Fabre Avatar answered Sep 03 '25 11:09

Jean-François Fabre