Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Compare values of a dictionary and return a count of matching values

I have a dictionary comprised of product names and unique customer emails who have purchased those items that looks like this:

customer_emails = {
'Backpack':['[email protected]','[email protected]','[email protected]','[email protected]'], 
'Baseball Bat':['[email protected]','[email protected]','[email protected]'],
'Gloves':['[email protected]','[email protected]','[email protected]']}

I am trying to iterate over the values of each key and determine how many emails match in the other keys. I converted this dictionary to a DataFrame and got the answer I wanted for a single column comparison using something like this

customers[customers['Baseball Bat'].notna() == True]['Baseball Bat'].isin(customers['Gloves']).sum()

What I'm trying to accomplish is to create a DataFrame that essentially looks like this so that I can easily use it for correlation charts.

             Backpack  Baseball Bat    Gloves
Backpack            4             2         3
Baseball Bat        2             3         1 
Gloves              3             1         3

I'm thinking the way to do it is to iterate over the customer_emails dictionary but I'm not sure how you would pick out a single key to compare its values to all others and so on, then store it.

like image 265
Trevor Theodore Avatar asked May 14 '18 17:05

Trevor Theodore


2 Answers

Start with pd.DataFrame.from_dict:

df = pd.DataFrame.from_dict(customer_emails, orient='index').T

df
              Backpack         Baseball Bat               Gloves
0  [email protected]  [email protected]  [email protected]
1  [email protected]  [email protected]  [email protected]
2  [email protected]  [email protected]    [email protected]
3    [email protected]                 None                 None

Now, use stack + get_dummies + sum + dot:

v = df.stack().str.get_dummies().sum(level=1)
v.dot(v.T)

              Backpack  Baseball Bat  Gloves
Backpack             4             2       3
Baseball Bat         2             3       1
Gloves               3             1       3

Alternatively, switch stack with melt for some added performance.

v = (df.melt()
       .set_index('variable')['value']
       .str.get_dummies()
       .sum(level=0)
)
v.dot(v.T)

variable      Backpack  Baseball Bat  Gloves
variable                                    
Backpack             4             2       3
Baseball Bat         2             3       1
Gloves               3             1       3
like image 187
cs95 Avatar answered Oct 15 '22 11:10

cs95


You can first find all the counts for each product and corresponding emails, then pass the resulting dictionary to pd.DataFrame:

import pandas as pd
emails = {'Baseball Bat': ['[email protected]', '[email protected]', '[email protected]'], 'Backpack': ['[email protected]', '[email protected]', '[email protected]', '[email protected]'], 'Gloves': ['[email protected]', '[email protected]', '[email protected]']}
results = {a:{c:sum(h in j for h in b) for c, j in emails.items()} for a, b in emails.items()}
df = pd.DataFrame(results)

Output:

               Backpack  Baseball Bat  Gloves
Backpack             4             2       3
Baseball Bat         2             3       1
Gloves               3             1       3
like image 44
Ajax1234 Avatar answered Oct 15 '22 10:10

Ajax1234