I have a data set with two columns. The first column contains unique user IDs and the second column contains attributes connected to these IDs.
For example:
------------------------
User ID Attribute
------------------------
1234 blond
1235 brunette
1236 blond
1234 tall
1235 tall
1236 short
------------------------
What I want to know is the correlation between attributes. In above example, i want to know how many times a blond is also tall. My desired output is:
------------------------------
Attr 1 Attr 2 Overlap
------------------------------
blond tall 1
blond short 1
brunette tall 1
brunette short 0
------------------------------
I tried using pandas to pivot the data and get the output, but as my data set has hundreds of attributes, my current attempt is not feasible.
df = pandas.read_csv('myfile.csv')
df.pivot_table(index='User ID', columns'Attribute', aggfunc=len, fill_value=0)
My current output:
--------------------------------
Blond Brunette Short Tall
--------------------------------
0 1 0 1
1 0 0 1
1 0 1 0
--------------------------------
Is there a way to get the output I want? Thanks in advance.
You coul use itertools product to find each possible attributes couple, and then match rows on this :
import pandas as pd
from itertools import product
# 1) creating pandas dataframe
df = [ ["1234" , "blond"],
["1235" , "brunette"],
["1236" , "blond" ],
["1234" , "tall"],
["1235" , "tall"],
["1236" , "short"]]
df = pd.DataFrame(df)
df.columns = ["id", "attribute"]
#2) creating all the possible attributes binomes
attributs = set(df.attribute)
for attribut1, attribut2 in product(attributs, attributs):
if attribut1!=attribut2:
#3) selecting the rows for each attribut
df1 = df[df.attribute == attribut1]["id"]
df2 = df[df.attribute == attribut2]["id"]
#4) finding the ids that are matching both attributs
intersection= len(set(df1).intersection(set(df2)))
if intersection:
#5) displaying the number of matches
print attribut1, attribut2, intersection
giving :
tall brunette 1
tall blond 1
brunette tall 1
blond tall 1
blond short 1
short blond 1
EDIT
it is then easy to refine to get your wished output :
import pandas as pd
from itertools import product
# 1) creating pandas dataframe
df = [ ["1234" , "blond"],
["1235" , "brunette"],
["1236" , "blond" ],
["1234" , "tall"],
["1235" , "tall"],
["1236" , "short"]]
df = pd.DataFrame(df)
df.columns = ["id", "attribute"]
wanted_attribute_1 = ["blond", "brunette"]
#2) creating all the possible attributes binomes
attributs = set(df.attribute)
for attribut1, attribut2 in product(attributs, attributs):
if attribut1 in wanted_attribute_1 and attribut2 not in wanted_attribute_1:
if attribut1!=attribut2:
#3) selecting the rows for each attribut
df1 = df[df.attribute == attribut1]["id"]
df2 = df[df.attribute == attribut2]["id"]
#4) finding the ids that are matching both attributs
intersection= len(set(df1).intersection(set(df2)))
#5) displaying the number of matches
print attribut1, attribut2, intersection
giving :
brunette tall 1
brunette short 0
blond tall 1
blond short 1
From your pivoted table, you can calculate the transposed crossproduct of itself, and then transform the upper triangular result to the long format:
import pandas as pd
import numpy as np
mat = df.pivot_table(index='User ID', columns='Attribute', aggfunc=len, fill_value=0)
tprod = mat.T.dot(mat) # calculate the tcrossprod here
result = tprod.where((np.triu(np.ones(tprod.shape, bool), 1)), np.nan).stack().rename('value')
# extract the upper triangular part
result.index.names = ['Attr1', 'Attr2']
result.reset_index().sort_values('value', ascending = False)

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With