I'm using package tm.
Say I have a data frame of 2 columns, 500 rows. The first column is ID which is randomly generated and has both character and number in it: "txF87uyK" The second column is actual text : "Today's weather is good. John went jogging. blah, blah,..."
Now I want to create a document-term matrix from this data frame.
My problem is I want to keep the ID information so that after I got the document-term matrix, I can join this matrix with another matrix that has each row being other information (date, topic, sentiment) of each document and each row is identified by document ID.
How can I do that?
Question 1: How do I convert this data frame into a corpus and get to keep ID information?
Question 2: After getting a dtm, how can I join it with another data set by ID?
There has been an update to the tm package in December 2017 and readTabular is gone
"Changes in tm version 0.7-2
SIGNIFICANT USER-VISIBLE CHANGES
DataframeSource now only processes data frames with the two mandatory columns "doc_id" and "text". Additional columns are used as document level metadata. This implements compatibility with Text Interchange Formats corpora (https://github.com/ropensci/tif)."
which makes it a bit easier to get your id (and whatever else metadata you need) for each document into corpus as described in https://cran.r-project.org/web/packages/tm/news.html
qdap 1.2.0 can do both tasks with little coding, though not a one liner ;-), and not necessarily faster than Ben's (as key_merge
is a convenience wrapper for merge
). Using all of Ben's data from above (which makes my answer look smaller when it's not that much smaller.
## The code
library(qdap)
mycorpus <- with(df, as.Corpus(txt, ID))
mydtm <- as.dtm(Filter(as.wfm(mycorpus,
col1 = "docs", col2 = "text",
stopwords = tm::stopwords("english")), 3, 10))
key_merge(matrix2df(mydtm, "ID"), df2, "ID")
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With