I'm a complete beginner in R and this is my first time to post on stackoverflow. Please be gentle:) I try to learn R by following tutorials and practical examples, but got stuck on this one and don't know what I do wrong. I try to follow the tutorial as posted here. But get the following error message half when I try to train the model: <pre class="prettyprint"><code>Error in na.fail.default(list(doc.class = c(3L, 1L, 1L, 1L, 1L, 1L, 1L, : missing values in object </code></pre> I hope someone can help me understand what is going on here? I inspected tdmTrain and it only contains <code>NA</code> values. I'm just not sure why and how to fix it. This is the code up to the step where I get the error message. <pre class="prettyprint"><code>library(NLP) library(tm) library(caret) r8train <- read.table("r8-train-all-terms.txt", header=FALSE, sep='\t') r8test <- read.table("r8-test-all-terms.txt", header=FALSE, sep='\t') # rename variables names(r8train) <- c("Class", "docText") names(r8test) <- c("Class", "docText") # convert the document text variable to character type r8train$docText <- as.character(r8train$docText) r8test$docText <- as.character(r8test$docText) # create varible to denote if observation is train or test r8train$train_test <- c("train") r8test$train_test <- c("test") # merge the train/test data merged <- rbind(r8train, r8test) # remove objects that are no longer needed remove(r8train, r8test) merged <- merged[which(merged$Class %in% c("crude","money-fx","trade")),] # drop unused levels in the response variable merged$Class <- droplevels(merged$Class) # counts of each class in the train/test sets table(merged$Class,merged$train_test) # a vector source interprets each element of the vector as a document sourceData <- VectorSource(merged$docText) # create the corpus corpus <- Corpus(sourceData) # preprocess/clean the training corpus corpus <- tm_map(corpus, content_transformer(tolower)) # convert to lowercase corpus <- tm_map(corpus, removeNumbers) # remove digits corpus <- tm_map(corpus, removePunctuation) # remove punctuation corpus <- tm_map(corpus, stripWhitespace) # strip extra whitespace corpus <- tm_map(corpus, removeWords, stopwords('english')) # remove stopwords # create term document matrix (tdm) tdm <- DocumentTermMatrix(corpus) as.matrix(tdm)[10:20,200:210] # inspect a portion of the tdm # create tf-idf weighted version of term document matrix weightedtdm <- weightTfIdf(tdm) as.matrix(weightedtdm)[10:20,200:210] # inspect same portion of the weighted tdm # find frequent terms: terms that appear in at least "250" documents here, about 25% of the docs findFreqTerms(tdm, 250) # convert tdm's into data frames tdm <- as.data.frame(inspect(tdm)) weightedtdm <- as.data.frame(inspect(weightedtdm)) # split back into train and test sets tdmTrain <- tdm[which(merged$train_test == "train"),] weightedTDMtrain <- weightedtdm[which(merged$train_test == "train"),] tdmTest <- tdm[which(merged$train_test == "test"),] weightedTDMtest <- weightedtdm[which(merged$train_test == "test"),] # remove objects that are no longer needed to conserve memory remove(tdm,weightedtdm) # append document labels as last column tdmTrain$doc.class <- merged$Class[which(merged$train_test == "train")] tdmTest$doc.class <- merged$Class[which(merged$train_test == "test")] weightedTDMtrain$doc.class <- merged$Class[which(merged$train_test == "train")] weightedTDMtest$doc.class <- merged$Class[which(merged$train_test == "test")] # set resampling scheme ctrl <- trainControl(method="repeatedcv",number = 10, repeats = 3) #,classProbs=TRUE) # fit a kNN model using the weighted (td-idf) term document matrix # tuning parameter: K set.seed(100) knn.tfidf <- train(doc.class ~ ., data = weightedTDMtrain, method = "knn", trControl = ctrl) #, tuneLength = 20) </code></pre>

The problem lies in this part of the code: <pre class="prettyprint"><code>tdm <- as.data.frame(inspect(tdm)) weightedtdm <- as.data.frame(inspect(weightedtdm)) dim(weightedtdm) #returns rows and columns 10 10 </code></pre> You never use this to create a data.frame out of a tdm. You only get the first 10 rows and 10 columns. Not all the data from the tdm. You need to use: <pre class="prettyprint"><code>tdm <- as.data.frame(as.matrix(tdm)) weightedtdm <- as.data.frame(as.matrix(weightedtdm)) dim(weightedtdm) [1] 993 9243 </code></pre> Here you can see the enormous difference between the 2 ways. Using the first weightedtdm will result in 700 NA values for all columns except doc.class when you run <code>weightedTDMtrain$doc.class <- merged$Class[which(merged$train_test == "train")]</code> This is the reason why <code>train</code> returns the error message. Using the second way will work and your <code>train</code> will start to run. (slowly because of the repeated cross validation.)

Why do I get Error in na.fail.default(list(doc.class = c(3L, 1L...missing values in object

Tags:

r

I'm a complete beginner in R and this is my first time to post on stackoverflow. Please be gentle:)

I try to learn R by following tutorials and practical examples, but got stuck on this one and don't know what I do wrong.

I try to follow the tutorial as posted here. But get the following error message half when I try to train the model:

Error in na.fail.default(list(doc.class = c(3L, 1L, 1L, 1L, 1L, 1L, 1L,  : 
  missing values in object

I hope someone can help me understand what is going on here? I inspected tdmTrain and it only contains NA values. I'm just not sure why and how to fix it.

This is the code up to the step where I get the error message.

library(NLP)
library(tm) 
library(caret) 

r8train <- read.table("r8-train-all-terms.txt", header=FALSE, sep='\t')
r8test <- read.table("r8-test-all-terms.txt", header=FALSE, sep='\t')

# rename variables
names(r8train) <- c("Class", "docText")
names(r8test) <- c("Class", "docText")

# convert the document text variable to character type
r8train$docText <- as.character(r8train$docText)
r8test$docText <- as.character(r8test$docText)

# create varible to denote if observation is train or test
r8train$train_test <- c("train")
r8test$train_test <- c("test")

# merge the train/test data
merged <- rbind(r8train, r8test)

# remove objects that are no longer needed 
remove(r8train, r8test)

merged <- merged[which(merged$Class %in% c("crude","money-fx","trade")),]

# drop unused levels in the response variable
merged$Class <- droplevels(merged$Class) 

# counts of each class in the train/test sets
table(merged$Class,merged$train_test)

# a vector source interprets each element of the vector as a document
sourceData <- VectorSource(merged$docText)

# create the corpus
corpus <- Corpus(sourceData)

# preprocess/clean the training corpus
corpus <- tm_map(corpus, content_transformer(tolower)) # convert to lowercase
corpus <- tm_map(corpus, removeNumbers) # remove digits
corpus <- tm_map(corpus, removePunctuation) # remove punctuation
corpus <- tm_map(corpus, stripWhitespace) # strip extra whitespace
corpus <- tm_map(corpus, removeWords, stopwords('english')) # remove stopwords

# create term document matrix (tdm)
tdm <- DocumentTermMatrix(corpus)

as.matrix(tdm)[10:20,200:210] # inspect a portion of the tdm

# create tf-idf weighted version of term document matrix
weightedtdm <- weightTfIdf(tdm)
as.matrix(weightedtdm)[10:20,200:210] # inspect same portion of the weighted tdm

# find frequent terms: terms that appear in at least "250" documents here, about 25% of the docs
findFreqTerms(tdm, 250)

# convert tdm's into data frames 
tdm <- as.data.frame(inspect(tdm))
weightedtdm <- as.data.frame(inspect(weightedtdm))

# split back into train and test sets
tdmTrain <- tdm[which(merged$train_test == "train"),]
weightedTDMtrain <- weightedtdm[which(merged$train_test == "train"),]

tdmTest <-  tdm[which(merged$train_test == "test"),]
weightedTDMtest <- weightedtdm[which(merged$train_test == "test"),]

# remove objects that are no longer needed to conserve memory
remove(tdm,weightedtdm)

# append document labels as last column
tdmTrain$doc.class <- merged$Class[which(merged$train_test == "train")]
tdmTest$doc.class <- merged$Class[which(merged$train_test == "test")]
weightedTDMtrain$doc.class <- merged$Class[which(merged$train_test == "train")]
weightedTDMtest$doc.class  <- merged$Class[which(merged$train_test == "test")]

# set resampling scheme
ctrl <- trainControl(method="repeatedcv",number = 10, repeats = 3) #,classProbs=TRUE)

# fit a kNN model using the weighted (td-idf) term document matrix
# tuning parameter: K
set.seed(100)
knn.tfidf <- train(doc.class ~ ., data = weightedTDMtrain, method = "knn", trControl = ctrl) #, tuneLength = 20)

743

asked May 11 '18 12:05

AndiADL

1 Answers

The problem lies in this part of the code:

tdm <- as.data.frame(inspect(tdm))
weightedtdm <- as.data.frame(inspect(weightedtdm))

dim(weightedtdm) #returns rows and columns
   10 10

You never use this to create a data.frame out of a tdm. You only get the first 10 rows and 10 columns. Not all the data from the tdm.

You need to use:

tdm <- as.data.frame(as.matrix(tdm))
weightedtdm <- as.data.frame(as.matrix(weightedtdm))

dim(weightedtdm)
[1]  993 9243

Here you can see the enormous difference between the 2 ways.

Using the first weightedtdm will result in 700 NA values for all columns except doc.class when you run weightedTDMtrain$doc.class <- merged$Class[which(merged$train_test == "train")] This is the reason why train returns the error message.

Using the second way will work and your train will start to run. (slowly because of the repeated cross validation.)

133

answered Oct 15 '22 09:10

phiver

Related questions
                            
                                How to convert between decimal and hex?
                            
                                How to change the strip color on kableExtra
                            
                                TryCatch with parLapply (Parallel package) in R
                            
                                How to properly join data and geometry using ggmap
                            
                                Using LaTeX animate package in RMarkdown
                            
                                Using lapply to apply function to each row in a tibble
                            
                                How to make a custom hoverinfo lables for Plotly boxplot?
                            
                                qr function in R and matlab
                            
                                Seeding a user supplied random number generator in R
                            
                                R Shiny - Automatically hide the sidebar when you navigate into tab items
                            
                                aligning all knitr tables in R markdown
                            
                                Justify individual axis labels in bold using ggplot2
                            
                                R: Best Practices - dplyr and odbc multi table actions (retrieved from SQL)
                            
                                Plotly: Parallel Coordinates Plot: Axis Styling
                            
                                Using optim() or optimize() functions in R
                            
                                r shiny: eventReactive is not reacting when the button is pressed
                            
                                Where does dput write when diverting output using sink?
                            
                                R (purrr) flatten list of named lists to list and keep names
                            
                                R: Remove nested for loops in order to make a custom bootstrap more efficient
                            
                                Reporting regression tables using rmarkdown in Word format

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With