Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to check a dataframe consisting of a list of strings against a lookup dataframe and perform calculations?

I have a dataframe df1 that contains rows of tokenized strings:

df1 = pd.DataFrame(data = {'tokens' : [['auditioned', 'lead', 'role', 'play', 
'play'], ['kittens', 'adopted', 'family'], ['peanut', 'butter', 'jelly', 
'sandwiches', 'favorite'], ['committee', 'decorated', 'gym'], ['surprise', 
'party', 'best', 'friends']]})

I also have a dataframe df2 that contains single-word strings as well as a score pertaining to each word:

df2 = pd.DataFrame(data = {'word' : ['adopted', 'auditioned',
'favorite', 'gym', 'play', 'sandwiches'], 'score' : [1, 2, 3, 4, 5,
6]})

What is the best way to use df2 as a sort of lookup "table" that I can also use to help perform calculations?

For each row in df1, I need to check whether any words exist in df2. If so, count the number of words found and store the result in a series called word_count (if a particular word appears more than once in df1, count each occurrence). Also, when a word in df1 exists in df2, sum the score of that word with any other words found in a series called total score. The final output should look like df3:

df3 = pd.DataFrame(data = {'tokens' : [['auditioned', 'lead', 'role', 'play', 'play'], ['kittens', 'adopted', 'family'], ['peanut', 'butter', 'jelly', 'sandwiches', 'favorite'], ['committee', 'decorated', 'gym'], ['surprise', 'party', 'best', 'friends']], 'word_count' : [3, 1, 2, 1, 0], 'total_score' : [12, 1, 9, 4, None]})
like image 492
Insu Q Avatar asked Jun 27 '18 15:06

Insu Q


People also ask

How do you check if a DataFrame column contains a string?

Method 1: Use isin() function In this scenario, the isin() function check the pandas column containing the string present in the list and return the column values when present, otherwise it will not select the dataframe columns.

How do you find the value of a DataFrame based on a condition?

You can extract a column of pandas DataFrame based on another value by using the DataFrame. query() method. The query() is used to query the columns of a DataFrame with a boolean expression. The blow example returns a Courses column where the Fee column value matches with 25000.

What does pandas describe () method return?

The describe() method returns description of the data in the DataFrame. If the DataFrame contains numerical data, the description contains these information for each column: count - The number of not-empty values.


1 Answers

Use:

d = df2.set_index('word')['score']

def f(x):
    y = [d.get(a) for a in x if a in d]
    return pd.Series([len(y), sum(y)], index=['word_count','total_score'])

df3[['word_count','total_score']] = df3['tokens'].apply(f)
print (df3)
                                          tokens  word_count  total_score
0           [auditioned, lead, role, play, play]           3           12
1                     [kittens, adopted, family]           1            1
2  [peanut, butter, jelly, sandwiches, favorite]           2            9
3                    [committee, decorated, gym]           1            4
4               [surprise, party, best, friends]           0            0
like image 80
jezrael Avatar answered Oct 15 '22 12:10

jezrael