tm: read in data frame, keep text id's, construct DTM and join to other dataset

Question

I'm using package tm.

Say I have a data frame of 2 columns, 500 rows. The first column is ID which is randomly generated and has both character and number in it: "txF87uyK" The second column is actual text : "Today's weather is good. John went jogging. blah, blah,..."

Now I want to create a document-term matrix from this data frame.

My problem is I want to keep the ID information so that after I got the document-term matrix, I can join this matrix with another matrix that has each row being other information (date, topic, sentiment) of each document and each row is identified by document ID.

How can I do that?

Question 1: How do I convert this data frame into a corpus and get to keep ID information?

Question 2: After getting a dtm, how can I join it with another data set by ID?

juhariis · Accepted Answer

There has been an update to the tm package in December 2017 and readTabular is gone

"Changes in tm version 0.7-2
SIGNIFICANT USER-VISIBLE CHANGES
DataframeSource now only processes data frames with the two mandatory columns "doc_id" and "text". Additional columns are used as document level metadata. This implements compatibility with Text Interchange Formats corpora (https://github.com/ropensci/tif)."

which makes it a bit easier to get your id (and whatever else metadata you need) for each document into corpus as described in https://cran.r-project.org/web/packages/tm/news.html

Tyler Rinker · Answer

qdap 1.2.0 can do both tasks with little coding, though not a one liner ;-), and not necessarily faster than Ben's (as key_merge is a convenience wrapper for merge). Using all of Ben's data from above (which makes my answer look smaller when it's not that much smaller.

## The code
library(qdap)
mycorpus <- with(df, as.Corpus(txt, ID))

mydtm <- as.dtm(Filter(as.wfm(mycorpus, 
     col1 = "docs", col2 = "text", 
     stopwords = tm::stopwords("english")), 3, 10))

key_merge(matrix2df(mydtm, "ID"), df2, "ID")

tm: read in data frame, keep text id's, construct DTM and join to other dataset

Tags:

r

text-mining

tm

GorillaInR

2 Answers

juhariis

Tyler Rinker

Recent Activity

Donate For Us

tm: read in data frame, keep text id's, construct DTM and join to other dataset

Tags:

r

text-mining

tm

GorillaInR

2 Answers

juhariis

Tyler Rinker

Related questions

Recent Activity

Donate For Us