I have a pandas dataframe that contains queries and counts for a given time period and I'm hoping to turn this dataframe into a count of unique words. For example, if the dataframe contained the below:
query count
foo bar 10
super 8
foo 4
super foo bar 2
I'm looking to receive the below dataframe. e.g. the word 'foo' appears exactly 16 times within the table.
word count
foo 16
bar 12
super 10
I'm working with the below function, but it hardly seems like the optimal way to do this and it ignores the total count for each row.
def _words(df):
return Counter(re.findall(r'\w+', ' '.join(df['query'])))
Any help will be greatly appreciated.
Thanks in advance!
Option 1
df['query'].str.get_dummies(sep=' ').T.dot(df['count'])
bar 12
foo 16
super 10
dtype: int64
Option 2
df['query'].str.get_dummies(sep=' ').mul(df['count'], axis=0).sum()
bar 12
foo 16
super 10
dtype: int64
Option 3
numpy.bincount + pd.factorize
also highlighting the use of cytoolz.mapcat. It returns an iterator where it maps a function and concatenates the results. That's cool!
import pandas as pd, numpy as np, cytoolz
q = df['query'].values
c = df['count'].values
f, u = pd.factorize(list(cytoolz.mapcat(str.split, q.tolist())))
l = np.core.defchararray.count(q.astype(str), ' ') + 1
pd.Series(np.bincount(f, c.repeat(l)).astype(int), u)
foo 16
bar 12
super 10
dtype: int64
Option 4
Absurd use of stuff... just use option 1.
pd.DataFrame(dict(
query=' '.join(df['query']).split(),
count=df['count'].repeat(df['query'].str.count(' ') + 1)
)).groupby('query')['count'].sum()
query
bar 12
foo 16
super 10
Name: count, dtype: int64
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With