Am using qdap package to determine the sentiment of each review comment of a particular application. I read the review comments from a CSV file and pass it to the polarity function of qdap. Everything works fine and I get the polarity for all the review comments but the problem is that it takes 7-8 seconds to calculate the polarity all the sentences (total number of sentences present in the CSV file is 779). I am pasting my code below. <pre class="prettyprint"><code> temp_csv <- filePath() attach(temp_csv) text_data <- temp_csv[,c('Content')] print(Sys.time()) polterms <- list(neg=c('wtf')) POLKEY <- sentiment_frame(positives=c(positive.words),negatives=c(polterms[[1]],negative.words)) polarity <- polarity(sentences, polarity.frame = POLKEY) print(Sys.time()) </code></pre> Time taken is as follows: [1] "2016-04-12 16:43:01 IST" [1] "2016-04-12 16:43:09 IST" Can somebody let me know if I am doing something wrong? How can I improve the performance?

I am the author of qdap. The <code>polarity</code> function was designed for much smaller data sets. As my role shifted I began to work with larger data sets. I needed fast and accurate (these two things are in opposition to each other) and have since developed a break away package sentimentr. The algorithm is optimized to be faster and more accurate than qdap's polarity. As it stands now you have 5 dictionary based (or trained alorithm based) approached to sentiment detection. Each has it's drawbacks (-) and pluses (+) and is useful in certain circumstances. <ol> <li> qdap +on CRAN; -slow</li> <li> syuzhet +on CRAN; +fast; +great plotting; -less accurate on non-literature use</li> <li> sentimentr +fast; +higher accuracy; -GitHub only</li> <li> stansent (stanford port) +most accurate; -slower</li> <li> tm.plugin.sentiment -archived on CRAN; -I couldn't get it working easily</li> </ol> I show time tests on sample data for the first 4 choices from above in the code below. <h3>Install packages and make timing functions</h3> I use pacman because it allows the reader to just run the code; though you can replace with <code>install.packages</code> & <code>library</code> calls. <pre class="prettyprint"><code>if (!require("pacman")) install.packages("pacman") pacman::p_load(qdap, syuzhet, dplyr) pacman::p_load_current_gh(c("trinker/stansent", "trinker/sentimentr")) pres_debates2012 #nrow = 2912 tic <- function (pos = 1, envir = as.environment(pos)){ assign(".tic", Sys.time(), pos = pos, envir = envir) Sys.time() } toc <- function (pos = 1, envir = as.environment(pos)) { difftime(Sys.time(), get(".tic", , pos = pos, envir = envir)) } id <- 1:2912 </code></pre> <h3>Timings</h3> <pre class="prettyprint"><code>## qdap tic() qdap_sent <- pres_debates2012 %>% with(qdap::polarity(dialogue, id)) toc() # Time difference of 18.14443 secs ## sentimentr tic() sentimentr_sent <- pres_debates2012 %>% with(sentiment(dialogue, id)) toc() # Time difference of 1.705685 secs ## syuzhet tic() syuzhet_sent <- pres_debates2012 %>% with(get_sentiment(dialogue, method="bing")) toc() # Time difference of 1.183647 secs ## stanford tic() stanford_sent <- pres_debates2012 %>% with(sentiment_stanford(dialogue)) toc() # Time difference of 6.724482 mins </code></pre> For more on timings and accuracy see my sentimentr README.md and please star the repo if it's useful. The viz below captures one of the tests from the README: <img src="https://i.stack.imgur.com/UZwie.png" alt="enter image description here">

Sentimental Analysis of review comments using qdap is slow

Tags:

r

shiny

sentiment-analysis

qdap

Am using qdap package to determine the sentiment of each review comment of a particular application. I read the review comments from a CSV file and pass it to the polarity function of qdap. Everything works fine and I get the polarity for all the review comments but the problem is that it takes 7-8 seconds to calculate the polarity all the sentences (total number of sentences present in the CSV file is 779). I am pasting my code below.

  temp_csv <- filePath()
  attach(temp_csv)
  text_data <- temp_csv[,c('Content')]
  print(Sys.time())
  polterms <- list(neg=c('wtf'))
  POLKEY <- sentiment_frame(positives=c(positive.words),negatives=c(polterms[[1]],negative.words))     
  polarity <- polarity(sentences, polarity.frame = POLKEY) 
  print(Sys.time())

Time taken is as follows:

[1] "2016-04-12 16:43:01 IST"

[1] "2016-04-12 16:43:09 IST"

Can somebody let me know if I am doing something wrong? How can I improve the performance?

367

asked Apr 12 '16 12:04

VenuSathya20

1 Answers

I am the author of qdap. The polarity function was designed for much smaller data sets. As my role shifted I began to work with larger data sets. I needed fast and accurate (these two things are in opposition to each other) and have since developed a break away package sentimentr. The algorithm is optimized to be faster and more accurate than qdap's polarity.

As it stands now you have 5 dictionary based (or trained alorithm based) approached to sentiment detection. Each has it's drawbacks (-) and pluses (+) and is useful in certain circumstances.

qdap +on CRAN; -slow
syuzhet +on CRAN; +fast; +great plotting; -less accurate on non-literature use
sentimentr +fast; +higher accuracy; -GitHub only
stansent (stanford port) +most accurate; -slower
tm.plugin.sentiment -archived on CRAN; -I couldn't get it working easily

I show time tests on sample data for the first 4 choices from above in the code below.

Install packages and make timing functions

I use pacman because it allows the reader to just run the code; though you can replace with install.packages & library calls.

if (!require("pacman")) install.packages("pacman")
pacman::p_load(qdap, syuzhet, dplyr)
pacman::p_load_current_gh(c("trinker/stansent", "trinker/sentimentr"))

pres_debates2012 #nrow = 2912

tic <- function (pos = 1, envir = as.environment(pos)){
    assign(".tic", Sys.time(), pos = pos, envir = envir)
    Sys.time()
}

toc <- function (pos = 1, envir = as.environment(pos)) {
    difftime(Sys.time(), get(".tic", , pos = pos, envir = envir))
}

id <- 1:2912

Timings

## qdap
tic()
qdap_sent <- pres_debates2012 %>%
    with(qdap::polarity(dialogue, id))
toc() # Time difference of 18.14443 secs


## sentimentr
tic()
sentimentr_sent <- pres_debates2012 %>%
    with(sentiment(dialogue, id))
toc() # Time difference of 1.705685 secs


## syuzhet
tic()
syuzhet_sent <- pres_debates2012 %>%
    with(get_sentiment(dialogue, method="bing"))
toc() # Time difference of 1.183647 secs


## stanford
tic()
stanford_sent <- pres_debates2012 %>%
    with(sentiment_stanford(dialogue))
toc() # Time difference of 6.724482 mins

For more on timings and accuracy see my sentimentr README.md and please star the repo if it's useful. The viz below captures one of the tests from the README:

enter image description here

answered Sep 28 '22 08:09

Tyler Rinker

Related questions
                            
                                Is it possible to use nested item in .Rd?
                            
                                error: installation of package ‘rgl’ had non-zero exit status
                            
                                display multiple plots in a list using grid.arrange in R
                            
                                Create data frame with all possible combinations of vectors x and y? [duplicate]
                            
                                How to find next particular day?
                            
                                How to change the line thickness of whiskers using stat_boxplot(geom = "errorbar")
                            
                                Sort data frame by two columns (with condition) [duplicate]
                            
                                R find last weekday of month
                            
                                Visualizing hierarchical data with circle packing in ggplot2?
                            
                                Integrate plotly with shinydashboard
                            
                                Import txt file in R ignoring first few lines
                            
                                data.table replace NA with mean for multiple columns and by id
                            
                                String split on a number word pattern
                            
                                How to match 2 dataframe columns and extract column values and column names?
                            
                                ggplot: Subset a layer where data is passed using a pipe
                            
                                Specify colors for each link in a force directed network, networkD3::forceNetwork()
                            
                                Reactive Function Parameters
                            
                                Error in predict() glmnet function: not-yet-implemented method
                            
                                Pass arguments in nested function to update default arguments
                            
                                R Shiny img() on UI side does not render the image

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With