Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Trying to get tf-idf weighting working in R

I am trying to do some very basic text analysis with the tm package and get some tf-idf scores; I'm running OS X (though I've tried this on Debian Squeeze with the same result); I've got a directory (which is my working directory) with a couple text files in it (the first containing the first three episodes of Ulysses, the second containing the second three episodes, if you must know).

R Version: 2.15.1 SessionInfo() Reports this about tm: [1] tm_0.5-8.3

Relevant bit of code:

library('tm')
corpus <- Corpus(DirSource('.'))
dtm <- DocumentTermMatrix(corpus,control=list(weight=weightTfIdf))

str(dtm)
List of 6
 $ i       : int [1:12456] 1 1 1 1 1 1 1 1 1 1 ...
 $ j       : int [1:12456] 2 10 12 17 20 24 29 30 32 34 ...
 $ v       : num [1:12456] 1 1 1 1 1 1 1 1 1 1 ...
 $ nrow    : int 2
 $ ncol    : int 10646
 $ dimnames:List of 2
  ..$ Docs : chr [1:2] "bloom.txt" "telemachiad.txt"
  ..$ Terms: chr [1:10646] "_--c'est" "_--et" "_--for" "_--goodbye," ...
 - attr(*, "class")= chr [1:2] "DocumentTermMatrix" "simple_triplet_matrix"
 - attr(*, "Weighting")= chr [1:2] "term frequency" "tf"

You will note, that the weighting appears to still be the default term frequency (tf) rather than the weighted tf-idf scores that I'd like.

Apologies if I'm missing something obvious, but based on the documentation I've read, this should work. The fault, no doubt, lies not in the stars...

like image 809
cforster Avatar asked Feb 11 '13 20:02

cforster


1 Answers

If you look at the DocumentTermMatrix help page, an at the example, you will see that the control argument is specified this way :

data(crude)
dtm <- DocumentTermMatrix(crude,
           control = list(weighting = function(x) weightTfIdf(x, normalize = FALSE),
                          stopwords = TRUE))

So, the weighting is specified with the list element named weighting, not weight. And you can specify this weighting by passing a function name or a custom function, as in the example. But the following works too :

data(crude)
dtm <- DocumentTermMatrix(crude, control = list(weighting = weightTfIdf))
like image 61
juba Avatar answered Sep 26 '22 19:09

juba