I'm trying to find the probability of a given word within a dataframe, but I'm getting a AttributeError: 'Series' object has no attribute 'columns'
error with my current setup. Hoping you can help me find where the error is.
I'm started with a dataframe that looks like the below, and transforming it to find the total count for each individual word with the below function.
query count
foo bar 10
super 8
foo 4
super foo bar 2
Function below:
def _words(df):
return df['query'].str.get_dummies(sep=' ').T.dot(df['count'])
Resulting in the below df (note 'foo' is 16 since it appears 16 times in the whole df):
bar 12
foo 16
super 10
The issue comes in when trying to find the probability of a given keyword within the df, which is currently does not append a column name. Below is what I'm currently working with, but it is throwing the "AttributeError: 'Series' object has no attribute 'columns'" error.
def _probability(df, query):
return df[query] / df.groupby['count'].sum()
My hope is that calling _probability(df, 'foo') will return 0.421052632 (16/(12+16+10)). Thanks in advance!
You could throw a pipe on the end of it:
df['query'].str.get_dummies(sep=' ').T.dot(df['count']).pipe(lambda x: x / x.sum())
bar 0.315789
foo 0.421053
super 0.263158
dtype: float64
Starting over:
This is more complicated but faster
from numpy.core.defchararray import count
q = df['query'].values
c = df['count'].values.repeat(count(q.astype(str), ' ') + 1)
f, u = pd.factorize(' '.join(q.tolist()).split())
b = np.bincount(f, c)
pd.Series(b / b.sum(), u)
foo 0.421053
bar 0.315789
super 0.263158
dtype: float64
Why not pass the new dataframe to the function?
df1 = df['query'].str.get_dummies(sep=' ').T.dot(df['count'])
def _probability(df, query):
return df[query] / df.sum()
_probability(df1, 'foo')
You get
0.42105263157894735
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With