Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Stem completion in R replaces names, not data

My team is doing some topic modeling on medium-sized chunks of text (tens of thousands of words), using the Quanteda package in R. I'd like to reduce words to word stems before the topic modeling process, so that I'm not counting variations on the same word as different topics.

Only problem is that the stemming algorithm leaves behind some words that aren't really words. "Happiness" stems to "happi," "arrange" stems to "arrang," and so on. So, before I visualize the results of the topic modeling, I'd like to restore the stems to complete words.

By reading through some previous threads here on StackOverflow, I came across a function, stemCompletion(), from the TM package, that does this, at least approximately. It seems to work reasonably well.

But when I apply it to the terms vector within a document text matrix, stemCompletion() always replaces the names of the character vector, not the characters themselves. Here's a reproducible example:

# Set up libraries
library(janeaustenr)
library(quanteda)
library(tm)

# Get first 200 words of Mansfield Park
words <- head(mansfieldpark, 200)

# Build a corpus from words
corpus <- quanteda::corpus(words)

# Eliminate some words from counting process
STOPWORDS <- c("the", "and", "a", "an")

# Create a document text matrix and do topic modeling
dtm <- corpus %>% 
    quanteda::dfm(remove_punct = TRUE,
                  remove = STOPWORDS) %>%
    quanteda::dfm_wordstem(.) %>% # Word stemming takes place here
    quanteda::convert("topicmodels")

# Word stems are now stored in dtm$dimnames$Terms

# View a sample of stemmed terms
tail(dtm$dimnames$Terms, 20)

# View the structure of dtm$dimnames$Terms (It's just a character vector)
str(dtm$dimnames$Terms)

# Apply tm::stemCompletion to Terms
unstemmed_terms <-
    tm::stemCompletion(dtm$dimnames$Terms, 
                       dictionary = words, # or corpus
                       type = "shortest")

# Result is composed entirely of NAs, with the values stored as names!
str(unstemmed_terms)

tail(unstemmed_terms, 20)

I'm looking for a way to get the results returned by stemCompletion() into a character vector, and not into the names attribute of a character vector. Any insights into this issue are much appreciated.

like image 533
J. Trimarco Avatar asked Apr 04 '18 22:04

J. Trimarco


1 Answers

The problem is that your dictionary argument to tm::stemCompletion() is not a character vector of words (or a tm Corpus object), but rather a set of lines from the Austen novel.

tail(words)
# [1] "most liberal-minded sister and aunt in the world."                        
# [2] ""                                                                         
# [3] "When the subject was brought forward again, her views were more fully"    
# [4] "explained; and, in reply to Lady Bertram's calm inquiry of \"Where shall" 
# [5] "the child come to first, sister, to you or to us?\" Sir Thomas heard with"
# [6] "some surprise that it would be totally out of Mrs. Norris's power to"   

But this can easily be tokenised using quanteda's tokens(), and converting that to a character vector.

unstemmed_terms <-
    tm::stemCompletion(dtm$dimnames$Terms, 
                       dictionary = as.character(tokens(words, remove_punct = TRUE)), 
                       type = "shortest")

tail(unstemmed_terms, 20)
#      arrang          chariti           perhap         parsonag          convers            happi 
# "arranging"               NA        "perhaps"               NA   "conversation"        "happily" 
#      belief             most     liberal-mind             aunt            again             view 
#    "belief"           "most" "liberal-minded"           "aunt"          "again"          "views" 
#     explain             calm          inquiri            where             come            heard 
# "explained"           "calm"               NA               NA           "come"          "heard" 
#     surpris            total 
#  "surprise"        "totally" 
like image 194
Ken Benoit Avatar answered Oct 18 '22 00:10

Ken Benoit