Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Best way to get joint probability from 2D numpy

Was wondering if there is a better way to get the probability of a 2D numpy array. Maybe using some of numpy's built in functions.

For simplicity, say we have an example array:

[['apple','pie'],
['apple','juice'],
['orange','pie'],
['strawberry','cream'],
['strawberry','candy']]

Would like to get the probability such as:

['apple' 'juice'] --> 0.4 * 0.5 = 0.2
['apple' 'pie']  --> 0.4 * 0.5 = 0.2
['orange' 'pie'] --> 0.2 * 1.0 = 0.2
['strawberry' 'candy'] --> 0.4 * 0.5 = 0.2
['strawberry' 'cream'] --> 0.4 * 0.5 = 0.2

Where 'juice' as the second word has a probabiliy of 0.2. Since apple has probability of 2/5 * 1/2 (for juice).

On the other hand, 'pie' as a second word, has a probability of 0.4. The combination of the probability from 'apple' and 'orange'.

The way I approached the problem was adding 3 new columns to the array, for the probability of 1st column, 2nd column, and the final probability. Group the array per 1st column, then per 2nd column and update the probability accordingly.

Below is my code:

a = np.array([['apple','pie'],['apple','juice'],['orange','pie'],['strawberry','cream'],['strawberry','candy']])

ans = []
unique, counts = np.unique(a.T[0], return_counts=True)                      ## TRANSPOSE a, AND GET unique
myCounter = zip(unique,counts)
num_rows = sum(counts)
a = np.c_[a,np.zeros(num_rows),np.zeros(num_rows),np.zeros(num_rows)]       ## ADD 3 COLUMNS to a

groups = []
## GATHER GROUPS BASE ON COLUMN 0
for _unique, _count in myCounter:
    index = a[:,0] == _unique                                               ## WHERE COLUMN 0 MATCH _unique
    curr_a = a[index]
    for j in range(len(curr_a)):
        curr_a[j][2] = _count/num_rows
    groups.append(curr_a)

## GATHER UNIQUENESS FROM COLUMN 1, PER GROUP
for g in groups:
    unique, counts = np.unique(g.T[1], return_counts=True)
    myCounter = zip(unique, counts)
    num_rows = sum(counts)

    for _unique, _count in myCounter:
        index = g[:, 1] == _unique
        curr_g = g[index]
        for j in range(len(curr_g)):
            curr_g[j][3] = _count / num_rows
            curr_g[j][4] = float(curr_g[j][2]) * float(curr_g[j][3])        ## COMPUTE FINAL PROBABILITY
        ans.append(curr_g[j])

for an in ans:
    print(an)

Outputs:

['apple' 'juice' '0.4' '0.5' '0.2']
['apple' 'pie' '0.4' '0.5' '0.2']
['orange' 'pie' '0.2' '1.0' '0.2']
['strawberry' 'candy' '0.4' '0.5' '0.2']
['strawberry' 'cream' '0.4' '0.5' '0.2']

Was wondering if there is a better short/faster way of doing this using numpy or other means. Adding columns is not necessary, this was just my way of doing it. Other approach will be acceptable.

like image 979
user1179317 Avatar asked Oct 29 '22 05:10

user1179317


1 Answers

Based on the definition of probability distribution you have given, you can use pandas to do the same i.e

import pandas as pd
a = np.array([['apple','pie'],['apple','juice'],['orange','pie'],['strawberry','cream'],['strawberry','candy']])

df = pd.DataFrame(a)
# Find the frequency of first word and divide by the total number of rows
df[2]=df[0].map(df[0].value_counts())/df.shape[0]
# Divide 1 by the total repetion 
df[3]=1/(df[0].map(df[0].value_counts()))
# Multiply the probabilities 
df[4]= df[2]*df[3]

Output:

            0      1    2    3    4
0       apple    pie  0.4  0.5  0.2
1       apple  juice  0.4  0.5  0.2
2      orange    pie  0.2  1.0  0.2
3  strawberry  cream  0.4  0.5  0.2
4  strawberry  candy  0.4  0.5  0.2

If you want that in the form of list you can use df.values.tolist()

If you dont want the columns then

df = pd.DataFrame(a)
df[2]=((df[0].map(df[0].value_counts())/df.shape[0]) * (1/(df[0].map(df[0].value_counts()))))

Output:

           0      1    2
0       apple    pie  0.2
1       apple  juice  0.2
2      orange    pie  0.2
3  strawberry  cream  0.2
4  strawberry  candy  0.2

For combined probablity print(df.groupby(1)[2].sum())

candy    0.2
cream    0.2
juice    0.2
pie      0.4
like image 56
Bharath Avatar answered Nov 11 '22 18:11

Bharath