I am trying to work with the tm package in R, and have a CSV file of customer feedback with each line being a different instance of feedback. I want to import all the content of this feedback into a corpus but I want each line to be a different document within the corpus, so that I can compare the feedback in a DocTerms Matrix. There are over 10,000 rows in my data set.
Originally I did the following:
fdbk_corpus <-Corpus(VectorSource(fdbk), readerControl = list(language="eng"), sep="\t")
This creates a corpus with 1 document and >10,000 rows, and I want >10,000 docs with 1 row each.
I imagine I could just have 10,000+ separate CSV or TXT documents within a folder and create a corpus from that... but I'm thinking there is a much simpler answer than that, reading each line as a separate document.
Reading a CSV file The CSV file to be read should be either present in the current working directory or the directory should be set accordingly using the setwd(…) command in R. The CSV file can also be read from a URL using read. csv() function.
In order to read multiple CSV files or all files from a folder in R, use data. table package. data. table is a third-party library hence, in order to use data.
Just use dat <- read. csv("file. csv") and then select the column with dat$column , and you'll get a vector. The csv is, by definition, a text file with columns separated with commas and the same number of columns for all lines.
Method 1: Using read. table() function. In this method of only importing the selected columns of the CSV file data, the user needs to call the read. table() function, which is an in-built function of R programming language, and then passes the selected column in its arguments to import particular columns from the data.
Here's a complete workflow to get what you want:
# change this file location to suit your machine
file_loc <- "C:\\Documents and Settings\\Administrator\\Desktop\\Book1.csv"
# change TRUE to FALSE if you have no column headings in the CSV
x <- read.csv(file_loc, header = TRUE)
require(tm)
corp <- Corpus(DataframeSource(x))
dtm <- DocumentTermMatrix(corp)
In the dtm
object each row will be a doc, or a line of your original CSV file. Each column will be a word.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With