tidytext, quanteda, and tm returning different tf-idf scores

Tags:

I am trying to work on tf-idf weighted corpus (where I expect tf to be a proportion by document rather than simple count). I would expect the same values to be returned by all the classic text mining libraries, but I am getting different values. Is there an error in my code (e.g. do I need to transpose an object?) or do the default parameters of tf-idf counts differ accross the packages?

library(tm)
library(tidyverse) 
library(quanteda)
df <- as.data.frame(cbind(doc = c("doc1", "doc2"), text = c("the quick brown fox jumps over the lazy dog", "The quick brown foxy ox jumps over the lazy god")), stringsAsFactors = F)

df.count1 <- df %>% unnest_tokens(word, text) %>% 
  count(doc, word) %>% 
  bind_tf_idf(word, doc, n) %>% 
  select(doc, word, tf_idf) %>% 
  spread(word, tf_idf, fill = 0) 

df.count2 <- df %>% unnest_tokens(word, text) %>% 
  count(doc, word) %>% 
  cast_dtm(document = doc,term = word, value = n, weighting = weightTfIdf) %>% 
  as.matrix() %>% as.data.frame()

df.count3 <- df %>% unnest_tokens(word, text) %>% 
  count(doc, word) %>% 
  cast_dfm(document = doc,term = word, value = n) %>% 
  dfm_tfidf() %>% as.data.frame()

   > df.count1
# A tibble: 2 x 12
  doc   brown    dog    fox   foxy    god jumps  lazy  over     ox quick   the
  <chr> <dbl>  <dbl>  <dbl>  <dbl>  <dbl> <dbl> <dbl> <dbl>  <dbl> <dbl> <dbl>
1 doc1      0 0.0770 0.0770 0      0          0     0     0 0          0     0
2 doc2      0 0      0      0.0693 0.0693     0     0     0 0.0693     0     0

> df.count2
     brown       dog       fox jumps lazy over quick the foxy god  ox
doc1     0 0.1111111 0.1111111     0    0    0     0   0  0.0 0.0 0.0
doc2     0 0.0000000 0.0000000     0    0    0     0   0  0.1 0.1 0.1

> df.count3
     brown     dog     fox jumps lazy over quick the    foxy     god      ox
doc1     0 0.30103 0.30103     0    0    0     0   0 0.00000 0.00000 0.00000
doc2     0 0.00000 0.00000     0    0    0     0   0 0.30103 0.30103 0.30103

738

asked Feb 15 '18 11:02

Radim

1 Answers

You stumbled upon the differences in calculating the term frequencies.

Standard definitions:

TF: Term Frequency: TF(t) = (Number of times term t appears in a document) / (Total number of terms in the document).

IDF: Inverse Document Frequency: IDF(t) = log(Total number of documents / Number of documents with term t in it)

Tf-idf weight is the product of these quantities TF * IDF

Looks simple, but it isn't. Let's calculate the tf_idf for the word dog in doc1.

First TF for dog: That is 1 term / 9 terms in doc = 0.11111

1/9 = 0.1111111

Now IDF for dog: the log of (2 documents / 1 term). Now there are multiple possibilities, namely: log (or natural log), log2 or log10!

log(2) = 0.6931472
log2(2) = 1
log10(2) = 0.30103

#tf_idf on log:
1/9 * log(2) = 0.07701635

#tf_idf on log2:
1/9 * log2(2)  = 0.11111

#tf_idf on log10:
1/9 * log10(2) = 0.03344778

Now it gets interesting. Tidytext gives you a correct weighting based on log. tm returns the tf_idf based on log2. I expected the value 0.03344778 from quanteda because their base is log10.

But looking into quanteda, it returns the result correctly, but uses a count as default instead of a proportional count. To get everything as it should be, try the code as follows:

df.count3 <- df %>% unnest_tokens(word, text) %>% 
  count(doc, word) %>% 
  cast_dfm(document = doc,term = word, value = n)


dfm_tfidf(df.count3, scheme_tf = "prop", scheme_df = "inverse")
Document-feature matrix of: 2 documents, 11 features (22.7% sparse).
2 x 11 sparse Matrix of class "dfm"
      features
docs   brown        fox        god jumps lazy over quick the      dog     foxy       ox
  doc1     0 0.03344778 0.03344778     0    0    0     0   0 0        0        0       
  doc2     0 0          0              0    0    0     0   0 0.030103 0.030103 0.030103

That looks better and this is based on log10.

If you use quanteda with adjustments to the parameters, you can get the tidytext or tm outcome by changing the base parameter.

# same as tidytext the natural log
dfm_tfidf(df.count3, scheme_tf = "prop", scheme_df = "inverse", base = exp(1))

# same as tm
dfm_tfidf(df.count3, scheme_tf = "prop", scheme_df = "inverse", base = 2)

answered Oct 02 '22 18:10

phiver

Related questions
                            
                                How can I write special characters in RMarkdown latex documents?
                            
                                Difference between runif and sample in R?
                            
                                How exactly are outliers removed in R boxplot and how can the same outliers be removed for further calculation (e.g. mean)?
                            
                                tm custom removePunctuation except hashtag
                            
                                How to merge two dataframes using multiple columns as key?
                            
                                How Can I manually obtain predict() values from coef/model.matrix returns on linear model
                            
                                How to simply multiply two columns of a dataframe? [duplicate]
                            
                                Find index of change in a column
                            
                                how to remove words of specific length in a string in R?
                            
                                R: How do I remove the first element from each inner element of a list without converting it to matrix?
                            
                                Arrange ggplot plots (grobs with same widths) using gtable to create 2x2 layout
                            
                                How to add a row names to a data frame in a magrittr chain
                            
                                Observe Event to Hide Action Button in Shiny
                            
                                How to have multiple groups in Python statsmodels linear mixed effects model?
                            
                                How to subset data in R without losing NA rows?
                            
                                calculating mean for every n values from a vector
                            
                                R: Using piping to pass a single argument to multiple locations in a function
                            
                                Specifying same limits for colorbar (legend) in ggplot2
                            
                                How to pass strings denoting expressions to dplyr 0.7 verbs?
                            
                                Rename in dplyr 0.7+ function

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

tidytext, quanteda, and tm returning different tf-idf scores

Tags:

r

text-mining

tm

tidytext

quanteda

Radim

People also ask

1 Answers

phiver

Recent Activity

Donate For Us