I've looked around and surprisingly haven't found an easy use of framework or existing code for the calculation of Pointwise Mutual Information (Wiki PMI) despite libraries like Scikit-learn offering a metric for overall Mutual Information (by histogram). This is in the context of Python and Pandas!
My problem:
I have a DataFrame with a series of [x,y] examples in each row and wish to calculate a series of PMI values as per the formula (or a simpler one):
PMI(x, y) = log( p(x,y) / p(x) * p(y) )
So far my approach is:
def pmi_func(df, x, y):
df['freq_x'] = df.groupby(x).transform('count')
df['freq_y'] = df.groupby(y).transform('count')
df['freq_x_y'] = df.groupby([x, y]).transform('count')
df['pmi'] = np.log( df['freq_x_y'] / (df['freq_x'] * df['freq_y']) )
Would this give a valid and/or efficient computation?
Sample I/O:
x y PMI
0 0 0.176
0 0 0.176
0 1 0
I would add three bits.
def pmi(dff, x, y):
df = dff.copy()
df['f_x'] = df.groupby(x)[x].transform('count')
df['f_y'] = df.groupby(y)[y].transform('count')
df['f_xy'] = df.groupby([x, y])[x].transform('count')
df['pmi'] = np.log(len(df.index) * df['f_xy'] / (df['f_x'] * df['f_y']) )
return df
df.groupby(x)[x].transform('count')
and df.groupby(y)[y].transform('count')
should be used so that only
count is retured.np.log(len(df.index) * df['f_xy'] / (df['f_x'] * df['f_y'])
probabilities to be used.If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With