Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Error faced while using TM package's VCorpus in R

I am facing the below error while working on the TM package with R.

library("tm")
Loading required package: NLP
Warning messages:
1: package ‘tm’ was built under R version 3.4.2 
2: package ‘NLP’ was built under R version 3.4.1 

corpus <- VCorpus(DataframeSource(data))

Error: all(!is.na(match(c("doc_id", "text"), names(x)))) is not TRUE

Have tried various ways like reinstalling the package, updating with new version of R but the error still persists. For the same data file the same code runs on another system with the same version of R.

like image 912
Saharsh Gandhi Avatar asked Nov 21 '17 06:11

Saharsh Gandhi


People also ask

What is a VCorpus in R?

VCorpus in tm refers to "Volatile" corpus which means that the corpus is stored in memory and would be destroyed when the R object containing it is destroyed.

What is the TM package in R?

tm provides a set of predefined sources, e.g., DirSource, VectorSource, or DataframeSource, which handle a directory, a vector interpreting each component as document, or data frame like structures (like CSV files), respectively.

Which function is used for converting vector into objects using TM library?

df2tm_corpus - Convert a qdap dataframe to a tm package Corpus .

What is Tm_map?

the tm_map() function is used to remove unnecessary white space, to convert the text to lower case, to remove common stopwords like 'the', “we”. The information value of 'stopwords' is near zero due to the fact that they are so common in a language. Removing this kind of words is useful before further analyses.


1 Answers

I met the same problem when I updated the tm package to 0.7-2 version. I looked for details of DataframeSource(), it mentioned:

The first column must be named "doc_id" and contain a unique string identifier for each document. The second column must be named "text".

Details

A data frame source interprets each row of the data frame x as a document. The first column must be named "doc_id" and contain a unique string identifier for each document. The second column must be named "text" and contain a "UTF-8" encoded string representing the document's content. Optional additional columns are used as document level metadata.

I solved it with the following code:

df_cmp<- read.csv("test_file.csv",stringsAsFactors = F)

df_title <- data.frame(doc_id=row.names(df_cmp),
                       text=df_cmp$English.title)

You can try and change the column names to doc_id and text.

like image 200
Eva Avatar answered Oct 21 '22 12:10

Eva