Can Pandas DataFrame efficiently calculate PMI (Pointwise Mutual Information)?

Question

I've looked around and surprisingly haven't found an easy use of framework or existing code for the calculation of Pointwise Mutual Information (Wiki PMI) despite libraries like Scikit-learn offering a metric for overall Mutual Information (by histogram). This is in the context of Python and Pandas!

My problem:

I have a DataFrame with a series of [x,y] examples in each row and wish to calculate a series of PMI values as per the formula (or a simpler one):

PMI(x, y) = log( p(x,y) / p(x) * p(y) )

So far my approach is:

def pmi_func(df, x, y):
    df['freq_x'] = df.groupby(x).transform('count')
    df['freq_y'] = df.groupby(y).transform('count')
    df['freq_x_y'] = df.groupby([x, y]).transform('count')
    df['pmi'] = np.log( df['freq_x_y'] / (df['freq_x'] * df['freq_y']) )

Would this give a valid and/or efficient computation?

Sample I/O:

x  y  PMI
0  0  0.176
0  0  0.176
0  1  0

Zero · Accepted Answer

I would add three bits.

def pmi(dff, x, y):
    df = dff.copy()
    df['f_x'] = df.groupby(x)[x].transform('count')
    df['f_y'] = df.groupby(y)[y].transform('count')
    df['f_xy'] = df.groupby([x, y])[x].transform('count')
    df['pmi'] = np.log(len(df.index) * df['f_xy'] / (df['f_x'] * df['f_y']) )
    return df

df.groupby(x)[x].transform('count') and df.groupby(y)[y].transform('count') should be used so that only count is retured.
np.log(len(df.index) * df['f_xy'] / (df['f_x'] * df['f_y']) probabilities to be used.
work on copy of dataframe, rather than modifying input dataframe.

Can Pandas DataFrame efficiently calculate PMI (Pointwise Mutual Information)?

Tags:

python

pandas

dataframe

entropy

jfive

1 Answers

Zero

Recent Activity

Donate For Us

Can Pandas DataFrame efficiently calculate PMI (Pointwise Mutual Information)?

Tags:

python

pandas

dataframe

entropy

jfive

1 Answers

Zero

Related questions

Recent Activity

Donate For Us