Error faced while using TM package's VCorpus in R

Q: What is a VCorpus in R?

VCorpus in tm refers to "Volatile" corpus which means that the corpus is stored in memory and would be destroyed when the R object containing it is destroyed.

Q: What is the TM package in R?

tm provides a set of predefined sources, e.g., DirSource, VectorSource, or DataframeSource, which handle a directory, a vector interpreting each component as document, or data frame like structures (like CSV files), respectively.

Q: Which function is used for converting vector into objects using TM library?

df2tm_corpus - Convert a qdap dataframe to a tm package Corpus .

Q: What is Tm_map?

the tm_map() function is used to remove unnecessary white space, to convert the text to lower case, to remove common stopwords like 'the', “we”. The information value of 'stopwords' is near zero due to the fact that they are so common in a language. Removing this kind of words is useful before further analyses.

Tags:

r

text-mining

tm

text-analysis

I am facing the below error while working on the TM package with R.

library("tm")
Loading required package: NLP
Warning messages:
1: package ‘tm’ was built under R version 3.4.2 
2: package ‘NLP’ was built under R version 3.4.1

corpus <- VCorpus(DataframeSource(data))

Error: all(!is.na(match(c("doc_id", "text"), names(x)))) is not TRUE

Have tried various ways like reinstalling the package, updating with new version of R but the error still persists. For the same data file the same code runs on another system with the same version of R.

912

asked Nov 21 '17 06:11

Saharsh Gandhi

1 Answers

I met the same problem when I updated the tm package to 0.7-2 version. I looked for details of DataframeSource(), it mentioned:

The first column must be named "doc_id" and contain a unique string identifier for each document. The second column must be named "text".

Details

A data frame source interprets each row of the data frame x as a document. The first column must be named "doc_id" and contain a unique string identifier for each document. The second column must be named "text" and contain a "UTF-8" encoded string representing the document's content. Optional additional columns are used as document level metadata.

I solved it with the following code:

df_cmp<- read.csv("test_file.csv",stringsAsFactors = F)

df_title <- data.frame(doc_id=row.names(df_cmp),
                       text=df_cmp$English.title)

You can try and change the column names to doc_id and text.

200

answered Oct 21 '22 12:10

Eva

Related questions
                            
                                Importing a .csv into R with UTF-8 encoding error?
                            
                                Convert Classes ‘tbl_df’, ‘tbl’ and 'data.frame into dataframe with R
                            
                                Call R functions in Rcpp [duplicate]
                            
                                Make Scrollbar appear in RMarkdown code chunks (html view)
                            
                                plotly regression line R
                            
                                plot regression line in R
                            
                                Error when using mice object: No applicable method for 'complete_'
                            
                                Compute area under density estimation curve, i.e., probability
                            
                                How to correctly `dput` a fitted linear model (by `lm`) to an ASCII file and recreate it later?
                            
                                defining custom dplyr methods in R package
                            
                                How to split a string on first number only
                            
                                How can I subscript names in a table from kable()?
                            
                                Getting rid of border in pdf output for geom_label for ggplot2 in R
                            
                                Order multiple variables in ggplot2
                            
                                Wider margins for grid.arrange function
                            
                                Table including explicit NAs in R > 3.4.0
                            
                                Aggregating values on a data tree with R
                            
                                How to get sha of current git commit from R
                            
                                circle around a geographic point with st_buffer
                            
                                How to set class_weight in keras package of R?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With