Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Frequency Per Term - R TM DocumentTermMatrix

I'm very new to R and cannot quite wrap my head around DocumentTermMatrixs. I have a DocumentTermMatrix created with the TM package, it has the term frequency and the terms inside it but I cannot figure out how to access them.

Ideally, I would like:

    Term  # 
    "the" 200 
    "is"  400 
    "a"   200 

Currently my code is:

    library(tm)
    common.words <- c("amp","@RT","I","http","https", stopwords("english"), "you")
    x <- Corpus(VectorSource(results)) 
    x <- tm_map(x, stripWhitespace) 
    x <- tm_map(x, removeNumbers) 
    x <- tm_map(x, removePunctuation) 
    x <- tm_map(x, stripWhitespace)

    dtm <- DocumentTermMatrix(x)
    for(i in 1:length(common.words)) {
    dtm <- dtm[,!colnames(dtm)%in%c(common.words[i])]
    }

This is the output from str(dtm)

   List of 6
   $ i       : int [1:9769] 1 1 1 1 1 1 1 1 2 2 ...
   $ j       : int [1:9769] 1596 1684 1858 2112 2175 2490 2714 2814 873 961 ...
   $ v       : num [1:9769] 1 1 2 1 1 2 1 1 1 1 ...
   $ nrow    : int 1477
   $ ncol    : int 3201
   $ dimnames:List of 2
   ..$ Docs : chr [1:1477] "1" "2" "3" "4" ...
   ..$ Terms: chr [1:3201] "\u0093\u0085a" "aardvark" "aaron" "abbie" ...
    - attr(*, "class")= chr [1:2] "DocumentTermMatrix" "simple_triplet_matrix"
    - attr(*, "Weighting")= chr [1:2] "term frequency" "tf"

Thank you,

-A

like image 461
user1994952 Avatar asked Dec 27 '22 11:12

user1994952


2 Answers

It appears to be a sparse matrix organization of the data. It appears that the frequency is in the "v" list and you get that by looking up the position of your term in the Terms attribute. Why not provide dput(head(results, 30)) so your code (and your SO audience) will have something to work on? After plying around with the examples in the package, I suspect you actually want something along the lines of:

tdm <- TermDocumentMatrix(x)
z <- inspect( tdm[ c("the", "is", "a"), dimnames(tdm)$Docs] )
rowSums(z)
like image 54
IRTFM Avatar answered Jan 08 '23 05:01

IRTFM


I had the same problem and found what I think is a simpler way:

num <- 10 # Show this many top frequent terms

tdm[findFreqTerms(tdm)[1:num],] %>%
      as.matrix() %>%
      rowSums()

Printing in columns is trickier (I'm sure someone has a much better way than this):

terms <- findFreqTerms(tdm)[1:num]
tdm[terms,] %>%
      as.matrix() %>%
      rowSums()  %>% 
      data.frame(Term = terms, Frequency = .) %>%  
      arrange(desc(Frequency))
like image 42
Bob Avatar answered Jan 08 '23 03:01

Bob