I have a dataframe df1
that contains rows of tokenized strings:
df1 = pd.DataFrame(data = {'tokens' : [['auditioned', 'lead', 'role', 'play',
'play'], ['kittens', 'adopted', 'family'], ['peanut', 'butter', 'jelly',
'sandwiches', 'favorite'], ['committee', 'decorated', 'gym'], ['surprise',
'party', 'best', 'friends']]})
I also have a dataframe df2
that contains single-word strings as well as a score pertaining to each word:
df2 = pd.DataFrame(data = {'word' : ['adopted', 'auditioned',
'favorite', 'gym', 'play', 'sandwiches'], 'score' : [1, 2, 3, 4, 5,
6]})
What is the best way to use df2
as a sort of lookup "table" that I can also use to help perform calculations?
For each row in df1
, I need to check whether any words exist in df2
. If so, count the number of words found and store the result in a series called word_count
(if a particular word appears more than once in df1
, count each occurrence). Also, when a word in df1
exists in df2
, sum the score of that word with any other words found in a series called total score
. The final output should look like df3
:
df3 = pd.DataFrame(data = {'tokens' : [['auditioned', 'lead', 'role', 'play', 'play'], ['kittens', 'adopted', 'family'], ['peanut', 'butter', 'jelly', 'sandwiches', 'favorite'], ['committee', 'decorated', 'gym'], ['surprise', 'party', 'best', 'friends']], 'word_count' : [3, 1, 2, 1, 0], 'total_score' : [12, 1, 9, 4, None]})
Method 1: Use isin() function In this scenario, the isin() function check the pandas column containing the string present in the list and return the column values when present, otherwise it will not select the dataframe columns.
You can extract a column of pandas DataFrame based on another value by using the DataFrame. query() method. The query() is used to query the columns of a DataFrame with a boolean expression. The blow example returns a Courses column where the Fee column value matches with 25000.
The describe() method returns description of the data in the DataFrame. If the DataFrame contains numerical data, the description contains these information for each column: count - The number of not-empty values.
Use:
d = df2.set_index('word')['score']
def f(x):
y = [d.get(a) for a in x if a in d]
return pd.Series([len(y), sum(y)], index=['word_count','total_score'])
df3[['word_count','total_score']] = df3['tokens'].apply(f)
print (df3)
tokens word_count total_score
0 [auditioned, lead, role, play, play] 3 12
1 [kittens, adopted, family] 1 1
2 [peanut, butter, jelly, sandwiches, favorite] 2 9
3 [committee, decorated, gym] 1 4
4 [surprise, party, best, friends] 0 0
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With