Retrieving sentence score based on values of words in a dictionary

Tags:

Edited df and dict

I have a data frame containing sentences:

df <- data_frame(text = c("I love pandas", "I hate monkeys", "pandas pandas pandas", "monkeys monkeys"))

And a dictionary containing words and their corresponding scores:

dict <- data_frame(word = c("love", "hate", "pandas", "monkeys"),
                   score = c(1,-1,1,-1))

I want to append a column "score" to df that would sum the score for each sentence:

Expected results

                  text score
1        I love pandas     2
2       I hate monkeys    -2
3 pandas pandas pandas     3
4      monkeys monkeys    -2

Update

Here are the results so far:

Akrun's methods

Suggestion 1

df %>% mutate(score = sapply(strsplit(text, ' '), function(x) with(dict, sum(score[word %in% x]))))

Note that for this method to work, I had to use data_frame() to create df and dict instead of data.frame() otherwise I get: Error in strsplit(text, " ") : non-character argument

Source: local data frame [4 x 2]

                  text score
1        I love pandas     2
2       I hate monkeys    -2
3 pandas pandas pandas     1
4      monkeys monkeys    -1

This does not accounts for multiple matches in a single string. Close to expected result, but not quite there yet.

Suggestion 2

I tweaked a bit one of akrun's suggestion in the comments to apply it to the edited post

cbind(df, unnest(stri_split_fixed(df$text, ' '), group) %>% 
        group_by(group) %>% 
        summarise(score = sum(dict$score[dict$word %in% x])) %>% 
        ungroup() %>% select(-group) %>% data.frame())

This does not account for multiple matches in a string:

                  text score
1        I love pandas     2
2       I hate monkeys    -2
3 pandas pandas pandas     1
4      monkeys monkeys    -1

Richard Scriven's methods

Suggestion 1

group_by(df, text) %>%
mutate(score = sum(dict$score[stri_detect_fixed(text, dict$word)]))

After updating all packages, this now works (although it does not account for multiple matches)

Source: local data frame [4 x 2]
Groups: text

                  text score
1        I love pandas     2
2       I hate monkeys    -2
3 pandas pandas pandas     1
4      monkeys monkeys    -1

Suggestion 2

total <- with(dict, {
  vapply(df$text, function(X) {
    sum(score[vapply(word, grepl, logical(1L), x = X, fixed = TRUE)])
  }, 1)
})

cbind(df, total)

This give the same results:

                  text total
1        I love pandas     2
2       I hate monkeys    -2
3 pandas pandas pandas     1
4      monkeys monkeys    -1

Suggestion 3

s <- strsplit(df$text, " ")
total <- vapply(s, function(x) sum(with(dict, score[match(x, word, 0L)])), 1)
cbind(df, total)

This actually works:

                  text total
1        I love pandas     2
2       I hate monkeys    -2
3 pandas pandas pandas     3
4      monkeys monkeys    -2

Thelatemail's method

res <- sapply(dict$word, function(x) {
  sapply(gregexpr(x,df$text),function(y) length(y[y!=-1]) )
})

cbind(df, score = rowSums(res * dict$score))

Note that I added the cbind() part. This actually match the expected result.

                  text score
1        I love pandas     2
2       I hate monkeys    -2
3 pandas pandas pandas     3
4      monkeys monkeys    -2

Final answer

Inspired by akrun's suggestion, here is what I ended up writing as the most dplyr-esque solution:

library(dplyr)
library(tidyr)
library(stringi)

bind_cols(df, unnest(stri_split_fixed(df$text, ' '), group) %>% 
            group_by(x) %>% mutate(score = sum(dict$score[dict$word %in% x])) %>% 
            group_by(group) %>% 
            summarise(score = sum(score)) %>% 
            select(-group))

Although I will implement Richard Scriven's suggestion #3 since it's the most efficient.

Benchmark

Here are the suggestions applied to much larger datasets (df of 93 sentences and dict of 14K words) using microbenchmark():

mbm = microbenchmark(
  akrun = df %>% mutate(score = sapply(stri_detect_fixed(text, ' '), function(x) with(dict, sum(score[word %in% x])))),
  akrun2 = cbind(df, unnest(stri_split_fixed(df$text, ' '), group) %>% group_by(group) %>% summarise(score = sum(dict$score[dict$word %in% x])) %>% ungroup() %>% select(-group) %>% data.frame()),
  rscriven1 = group_by(df, text) %>% mutate(score = sum(dict$score[stri_detect_fixed(text, dict$word)])),
  rscriven2 = cbind(df, score = with(dict, { vapply(df$text, function(X) { sum(score[vapply(word, grepl, logical(1L), x = X, fixed = TRUE)])}, 1)})),
  rscriven3 = cbind(df, score = vapply(strsplit(df$text, " "), function(x) sum(with(dict, score[match(x, word, 0L)])), 1)),
  thelatemail = cbind(df, score = rowSums(sapply(dict$word, function(x) { sapply(gregexpr(x,df$text),function(y) length(y[y!=-1]) ) }) * dict$score)),
  sbeaupre = bind_cols(df, unnest(stri_split_fixed(df$text, ' '), group) %>% group_by(x) %>% mutate(score = sum(dict$score[dict$word %in% x])) %>% group_by(group) %>% summarise(score = sum(score)) %>% select(-group)),
  times = 10
)

And the results:

enter image description here

985

asked Jan 21 '15 05:01

Steven Beaupré

1 Answers

Update : Here's the easiest dplyr method I've found so far. And I'll add a stringi function to speed things up. Provided there are no identical sentences in df$text, we can group by that column and then apply mutate()

Note: Package versions are dplyr 0.4.1 and stringi 0.4.1

library(dplyr)
library(stringi)

group_by(df, text) %>%
    mutate(score = sum(dict$score[stri_detect_fixed(text, dict$word)]))
# Source: local data frame [2 x 2]
# Groups: text
#
#             text score
# 1  I love pandas     2
# 2 I hate monkeys    -2

I removed the do() method I posted last night, but you can find it in the edit history. To me it seems unnecessary since the above method works as well and is the more dplyr way to do it.

Additionally, if you're open to a non-dplyr answer, here are two using base functions.

total <- with(dict, {
    vapply(df$text, function(X) {
        sum(score[vapply(word, grepl, logical(1L), x = X, fixed = TRUE)])
    }, 1)
})
cbind(df, total)
#             text total
# 1  I love pandas     2
# 2 I hate monkeys    -2

Or an alternative using strsplit() produces the same result

s <- strsplit(df$text, " ")
total <- vapply(s, function(x) sum(with(dict, score[match(x, word, 0L)])), 1)
cbind(df, total)

186

answered Sep 24 '22 13:09

Rich Scriven

Related questions
                            
                                How to plot this cluster in R?
                            
                                rgl plot3d and snapshot3d: how do I save with resolution higher than my monitor?
                            
                                Changing lubridate function to start on Monday rather than Sunday
                            
                                replace <NA> with NA
                            
                                possible bug in `rbinom()` for large numbers of trials
                            
                                R replacing zeros in dataframe with next non zero value
                            
                                keep only hour: minute:second from a "POSIXlt" "POSIXt" object
                            
                                Convert FIX message format ("Tag=Value") into CSV
                            
                                RHbase/thrift install issue
                            
                                Avoid conflicts between vector and variable name in dplyr
                            
                                Import multiple json files from a directory and attaching the data
                            
                                how to check if page finished loading in RSelenium
                            
                                Real-time auto incrementing ggplot in R
                            
                                How do you conduct a power analysis for logistic regression in R?
                            
                                Visualizing the Kolmogorov-Smirnov statistic in ggplot2
                            
                                Temporarily change locale settings
                            
                                How to extract attributes values from svyciprop object?
                            
                                ggvis line chart with interactive x axis range
                            
                                Adding a legend/text table to a ggmap with multiple layers
                            
                                How to use validity functions correctly with inherited S4 classes in R

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Retrieving sentence score based on values of words in a dictionary

Tags:

r

lapply

sapply

dplyr

stringi

Steven Beaupré

People also ask

1 Answers

Rich Scriven

Recent Activity

Donate For Us