Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

tm: read in data frame, keep text id's, construct DTM and join to other dataset

Tags:

r

text-mining

tm

I'm using package tm.

Say I have a data frame of 2 columns, 500 rows. The first column is ID which is randomly generated and has both character and number in it: "txF87uyK" The second column is actual text : "Today's weather is good. John went jogging. blah, blah,..."

Now I want to create a document-term matrix from this data frame.

My problem is I want to keep the ID information so that after I got the document-term matrix, I can join this matrix with another matrix that has each row being other information (date, topic, sentiment) of each document and each row is identified by document ID.

How can I do that?

Question 1: How do I convert this data frame into a corpus and get to keep ID information?

Question 2: After getting a dtm, how can I join it with another data set by ID?

like image 445
GorillaInR Avatar asked Nov 08 '13 02:11

GorillaInR


2 Answers

There has been an update to the tm package in December 2017 and readTabular is gone

"Changes in tm version 0.7-2
SIGNIFICANT USER-VISIBLE CHANGES
DataframeSource now only processes data frames with the two mandatory columns "doc_id" and "text". Additional columns are used as document level metadata. This implements compatibility with Text Interchange Formats corpora (https://github.com/ropensci/tif)."

which makes it a bit easier to get your id (and whatever else metadata you need) for each document into corpus as described in https://cran.r-project.org/web/packages/tm/news.html

like image 50
juhariis Avatar answered Sep 22 '22 01:09

juhariis


qdap 1.2.0 can do both tasks with little coding, though not a one liner ;-), and not necessarily faster than Ben's (as key_merge is a convenience wrapper for merge). Using all of Ben's data from above (which makes my answer look smaller when it's not that much smaller.

## The code
library(qdap)
mycorpus <- with(df, as.Corpus(txt, ID))

mydtm <- as.dtm(Filter(as.wfm(mycorpus, 
     col1 = "docs", col2 = "text", 
     stopwords = tm::stopwords("english")), 3, 10))

key_merge(matrix2df(mydtm, "ID"), df2, "ID")
like image 27
Tyler Rinker Avatar answered Sep 24 '22 01:09

Tyler Rinker