I am trying to work on tf-idf weighted corpus (where I expect tf to be a proportion by document rather than simple count). I would expect the same values to be returned by all the classic text mining libraries, but I am getting different values. Is there an error in my code (e.g. do I need to transpose an object?) or do the default parameters of tf-idf counts differ accross the packages?
library(tm)
library(tidyverse)
library(quanteda)
df <- as.data.frame(cbind(doc = c("doc1", "doc2"), text = c("the quick brown fox jumps over the lazy dog", "The quick brown foxy ox jumps over the lazy god")), stringsAsFactors = F)
df.count1 <- df %>% unnest_tokens(word, text) %>%
count(doc, word) %>%
bind_tf_idf(word, doc, n) %>%
select(doc, word, tf_idf) %>%
spread(word, tf_idf, fill = 0)
df.count2 <- df %>% unnest_tokens(word, text) %>%
count(doc, word) %>%
cast_dtm(document = doc,term = word, value = n, weighting = weightTfIdf) %>%
as.matrix() %>% as.data.frame()
df.count3 <- df %>% unnest_tokens(word, text) %>%
count(doc, word) %>%
cast_dfm(document = doc,term = word, value = n) %>%
dfm_tfidf() %>% as.data.frame()
> df.count1
# A tibble: 2 x 12
doc brown dog fox foxy god jumps lazy over ox quick the
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 doc1 0 0.0770 0.0770 0 0 0 0 0 0 0 0
2 doc2 0 0 0 0.0693 0.0693 0 0 0 0.0693 0 0
> df.count2
brown dog fox jumps lazy over quick the foxy god ox
doc1 0 0.1111111 0.1111111 0 0 0 0 0 0.0 0.0 0.0
doc2 0 0.0000000 0.0000000 0 0 0 0 0 0.1 0.1 0.1
> df.count3
brown dog fox jumps lazy over quick the foxy god ox
doc1 0 0.30103 0.30103 0 0 0 0 0 0.00000 0.00000 0.00000
doc2 0 0.00000 0.00000 0 0 0 0 0 0.30103 0.30103 0.30103
Calculating tf-idf attempts to find the words that are important (i.e., common) in a text, but not too common. Let’s do that now. The bind_tf_idf () function in the tidytext package takes a tidy text dataset as input with one row per token (term), per document.
The TF-IDF score is the product of the TF and the IDF scores. In simple terms, the TF-IDF score says, "Hey! I believe terms that occur frequently in a particular document are likely more important than terms that occur frequently in all documents in the corpus!"
We can use tidy data principles, as described in Chapter 1, to approach tf-idf analysis and use consistent, effective tools to quantify how important various terms are in a document that is part of a collection. Let’s start by looking at the published novels of Jane Austen and examine first term frequency, then tf-idf.
idf (t) = log (N/ df (t)) Computation: Tf-idf is one of the best metrics to determine how significant a term is to a text in a series or a corpus. tf-idf is a weighting system that assigns a weight to each word in a document based on its term frequency (tf) and the reciprocal document frequency (tf) (idf).
You stumbled upon the differences in calculating the term frequencies.
Standard definitions:
TF: Term Frequency: TF(t) = (Number of times term t appears in a document) / (Total number of terms in the document).
IDF: Inverse Document Frequency: IDF(t) = log(Total number of documents / Number of documents with term t in it)
Tf-idf weight is the product of these quantities TF * IDF
Looks simple, but it isn't. Let's calculate the tf_idf for the word dog in doc1.
First TF for dog: That is 1 term / 9 terms in doc = 0.11111
1/9 = 0.1111111
Now IDF for dog: the log of (2 documents / 1 term). Now there are multiple possibilities, namely: log (or natural log), log2 or log10!
log(2) = 0.6931472
log2(2) = 1
log10(2) = 0.30103
#tf_idf on log:
1/9 * log(2) = 0.07701635
#tf_idf on log2:
1/9 * log2(2) = 0.11111
#tf_idf on log10:
1/9 * log10(2) = 0.03344778
Now it gets interesting. Tidytext
gives you a correct weighting based on log. tm
returns the tf_idf based on log2. I expected the value 0.03344778 from quanteda because their base is log10.
But looking into quanteda, it returns the result correctly, but uses a count as default instead of a proportional count. To get everything as it should be, try the code as follows:
df.count3 <- df %>% unnest_tokens(word, text) %>%
count(doc, word) %>%
cast_dfm(document = doc,term = word, value = n)
dfm_tfidf(df.count3, scheme_tf = "prop", scheme_df = "inverse")
Document-feature matrix of: 2 documents, 11 features (22.7% sparse).
2 x 11 sparse Matrix of class "dfm"
features
docs brown fox god jumps lazy over quick the dog foxy ox
doc1 0 0.03344778 0.03344778 0 0 0 0 0 0 0 0
doc2 0 0 0 0 0 0 0 0 0.030103 0.030103 0.030103
That looks better and this is based on log10.
If you use quanteda
with adjustments to the parameters, you can get the tidytext
or tm
outcome by changing the base
parameter.
# same as tidytext the natural log
dfm_tfidf(df.count3, scheme_tf = "prop", scheme_df = "inverse", base = exp(1))
# same as tm
dfm_tfidf(df.count3, scheme_tf = "prop", scheme_df = "inverse", base = 2)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With