I have a dictionary comprised of product names and unique customer emails who have purchased those items that looks like this:
customer_emails = {
'Backpack':['[email protected]','[email protected]','[email protected]','[email protected]'],
'Baseball Bat':['[email protected]','[email protected]','[email protected]'],
'Gloves':['[email protected]','[email protected]','[email protected]']}
I am trying to iterate over the values of each key and determine how many emails match in the other keys. I converted this dictionary to a DataFrame and got the answer I wanted for a single column comparison using something like this
customers[customers['Baseball Bat'].notna() == True]['Baseball Bat'].isin(customers['Gloves']).sum()
What I'm trying to accomplish is to create a DataFrame that essentially looks like this so that I can easily use it for correlation charts.
Backpack Baseball Bat Gloves
Backpack 4 2 3
Baseball Bat 2 3 1
Gloves 3 1 3
I'm thinking the way to do it is to iterate over the customer_emails
dictionary but I'm not sure how you would pick out a single key to compare its values to all others and so on, then store it.
Start with pd.DataFrame.from_dict
:
df = pd.DataFrame.from_dict(customer_emails, orient='index').T
df
Backpack Baseball Bat Gloves
0 [email protected] [email protected] [email protected]
1 [email protected] [email protected] [email protected]
2 [email protected] [email protected] [email protected]
3 [email protected] None None
Now, use stack
+ get_dummies
+ sum
+ dot
:
v = df.stack().str.get_dummies().sum(level=1)
v.dot(v.T)
Backpack Baseball Bat Gloves
Backpack 4 2 3
Baseball Bat 2 3 1
Gloves 3 1 3
Alternatively, switch stack
with melt
for some added performance.
v = (df.melt()
.set_index('variable')['value']
.str.get_dummies()
.sum(level=0)
)
v.dot(v.T)
variable Backpack Baseball Bat Gloves
variable
Backpack 4 2 3
Baseball Bat 2 3 1
Gloves 3 1 3
You can first find all the counts for each product and corresponding emails, then pass the resulting dictionary to pd.DataFrame
:
import pandas as pd
emails = {'Baseball Bat': ['[email protected]', '[email protected]', '[email protected]'], 'Backpack': ['[email protected]', '[email protected]', '[email protected]', '[email protected]'], 'Gloves': ['[email protected]', '[email protected]', '[email protected]']}
results = {a:{c:sum(h in j for h in b) for c, j in emails.items()} for a, b in emails.items()}
df = pd.DataFrame(results)
Output:
Backpack Baseball Bat Gloves
Backpack 4 2 3
Baseball Bat 2 3 1
Gloves 3 1 3
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With