Was wondering if there is a better way to get the probability of a 2D numpy array. Maybe using some of numpy's built in functions.
For simplicity, say we have an example array:
[['apple','pie'],
['apple','juice'],
['orange','pie'],
['strawberry','cream'],
['strawberry','candy']]
Would like to get the probability such as:
['apple' 'juice'] --> 0.4 * 0.5 = 0.2
['apple' 'pie'] --> 0.4 * 0.5 = 0.2
['orange' 'pie'] --> 0.2 * 1.0 = 0.2
['strawberry' 'candy'] --> 0.4 * 0.5 = 0.2
['strawberry' 'cream'] --> 0.4 * 0.5 = 0.2
Where 'juice' as the second word has a probabiliy of 0.2. Since apple has probability of 2/5 * 1/2 (for juice).
On the other hand, 'pie' as a second word, has a probability of 0.4. The combination of the probability from 'apple' and 'orange'.
The way I approached the problem was adding 3 new columns to the array, for the probability of 1st column, 2nd column, and the final probability. Group the array per 1st column, then per 2nd column and update the probability accordingly.
Below is my code:
a = np.array([['apple','pie'],['apple','juice'],['orange','pie'],['strawberry','cream'],['strawberry','candy']])
ans = []
unique, counts = np.unique(a.T[0], return_counts=True) ## TRANSPOSE a, AND GET unique
myCounter = zip(unique,counts)
num_rows = sum(counts)
a = np.c_[a,np.zeros(num_rows),np.zeros(num_rows),np.zeros(num_rows)] ## ADD 3 COLUMNS to a
groups = []
## GATHER GROUPS BASE ON COLUMN 0
for _unique, _count in myCounter:
index = a[:,0] == _unique ## WHERE COLUMN 0 MATCH _unique
curr_a = a[index]
for j in range(len(curr_a)):
curr_a[j][2] = _count/num_rows
groups.append(curr_a)
## GATHER UNIQUENESS FROM COLUMN 1, PER GROUP
for g in groups:
unique, counts = np.unique(g.T[1], return_counts=True)
myCounter = zip(unique, counts)
num_rows = sum(counts)
for _unique, _count in myCounter:
index = g[:, 1] == _unique
curr_g = g[index]
for j in range(len(curr_g)):
curr_g[j][3] = _count / num_rows
curr_g[j][4] = float(curr_g[j][2]) * float(curr_g[j][3]) ## COMPUTE FINAL PROBABILITY
ans.append(curr_g[j])
for an in ans:
print(an)
Outputs:
['apple' 'juice' '0.4' '0.5' '0.2']
['apple' 'pie' '0.4' '0.5' '0.2']
['orange' 'pie' '0.2' '1.0' '0.2']
['strawberry' 'candy' '0.4' '0.5' '0.2']
['strawberry' 'cream' '0.4' '0.5' '0.2']
Was wondering if there is a better short/faster way of doing this using numpy or other means. Adding columns is not necessary, this was just my way of doing it. Other approach will be acceptable.
Based on the definition of probability distribution you have given, you can use pandas
to do the same i.e
import pandas as pd
a = np.array([['apple','pie'],['apple','juice'],['orange','pie'],['strawberry','cream'],['strawberry','candy']])
df = pd.DataFrame(a)
# Find the frequency of first word and divide by the total number of rows
df[2]=df[0].map(df[0].value_counts())/df.shape[0]
# Divide 1 by the total repetion
df[3]=1/(df[0].map(df[0].value_counts()))
# Multiply the probabilities
df[4]= df[2]*df[3]
Output:
0 1 2 3 4 0 apple pie 0.4 0.5 0.2 1 apple juice 0.4 0.5 0.2 2 orange pie 0.2 1.0 0.2 3 strawberry cream 0.4 0.5 0.2 4 strawberry candy 0.4 0.5 0.2
If you want that in the form of list you can use df.values.tolist()
If you dont want the columns then
df = pd.DataFrame(a)
df[2]=((df[0].map(df[0].value_counts())/df.shape[0]) * (1/(df[0].map(df[0].value_counts()))))
Output:
0 1 2 0 apple pie 0.2 1 apple juice 0.2 2 orange pie 0.2 3 strawberry cream 0.2 4 strawberry candy 0.2
For combined probablity print(df.groupby(1)[2].sum())
candy 0.2 cream 0.2 juice 0.2 pie 0.4
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With