Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Can Pandas DataFrame efficiently calculate PMI (Pointwise Mutual Information)?

I've looked around and surprisingly haven't found an easy use of framework or existing code for the calculation of Pointwise Mutual Information (Wiki PMI) despite libraries like Scikit-learn offering a metric for overall Mutual Information (by histogram). This is in the context of Python and Pandas!

My problem:

I have a DataFrame with a series of [x,y] examples in each row and wish to calculate a series of PMI values as per the formula (or a simpler one):

PMI(x, y) = log( p(x,y) / p(x) * p(y) )

So far my approach is:

def pmi_func(df, x, y):
    df['freq_x'] = df.groupby(x).transform('count')
    df['freq_y'] = df.groupby(y).transform('count')
    df['freq_x_y'] = df.groupby([x, y]).transform('count')
    df['pmi'] = np.log( df['freq_x_y'] / (df['freq_x'] * df['freq_y']) )

Would this give a valid and/or efficient computation?

Sample I/O:

x  y  PMI
0  0  0.176
0  0  0.176
0  1  0
like image 605
jfive Avatar asked Dec 24 '22 08:12

jfive


1 Answers

I would add three bits.

def pmi(dff, x, y):
    df = dff.copy()
    df['f_x'] = df.groupby(x)[x].transform('count')
    df['f_y'] = df.groupby(y)[y].transform('count')
    df['f_xy'] = df.groupby([x, y])[x].transform('count')
    df['pmi'] = np.log(len(df.index) * df['f_xy'] / (df['f_x'] * df['f_y']) )
    return df
  1. df.groupby(x)[x].transform('count') and df.groupby(y)[y].transform('count') should be used so that only count is retured.
  2. np.log(len(df.index) * df['f_xy'] / (df['f_x'] * df['f_y']) probabilities to be used.
  3. work on copy of dataframe, rather than modifying input dataframe.
like image 88
Zero Avatar answered May 16 '23 06:05

Zero