I am facing the below error while working on the TM package with R.
library("tm")
Loading required package: NLP
Warning messages:
1: package ‘tm’ was built under R version 3.4.2
2: package ‘NLP’ was built under R version 3.4.1
corpus <- VCorpus(DataframeSource(data))
Error: all(!is.na(match(c("doc_id", "text"), names(x)))) is not TRUE
Have tried various ways like reinstalling the package, updating with new version of R but the error still persists. For the same data file the same code runs on another system with the same version of R.
VCorpus in tm refers to "Volatile" corpus which means that the corpus is stored in memory and would be destroyed when the R object containing it is destroyed.
tm provides a set of predefined sources, e.g., DirSource, VectorSource, or DataframeSource, which handle a directory, a vector interpreting each component as document, or data frame like structures (like CSV files), respectively.
df2tm_corpus - Convert a qdap dataframe to a tm package Corpus .
the tm_map() function is used to remove unnecessary white space, to convert the text to lower case, to remove common stopwords like 'the', “we”. The information value of 'stopwords' is near zero due to the fact that they are so common in a language. Removing this kind of words is useful before further analyses.
I met the same problem when I updated the tm
package to 0.7-2 version.
I looked for details of DataframeSource()
, it mentioned:
The first column must be named "doc_id" and contain a unique string identifier for each document. The second column must be named "text".
Details
A data frame source interprets each row of the data frame x as a document. The first column must be named "doc_id" and contain a unique string identifier for each document. The second column must be named "text" and contain a "UTF-8" encoded string representing the document's content. Optional additional columns are used as document level metadata.
I solved it with the following code:
df_cmp<- read.csv("test_file.csv",stringsAsFactors = F)
df_title <- data.frame(doc_id=row.names(df_cmp),
text=df_cmp$English.title)
You can try and change the column names to doc_id
and text
.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With