I have searched stackoverflow and the web and can only find partial solutions OR some that don't work due to changes in TM or qdap. Problem below: I have a dataframe: ID and Text (Simple document id/name and then some text) I have two issues: Part 1: How can I create a tdm or dtm and maintain the document name/id? It only shows "character(0)" on inspect(tdm). Part 2: I want to keep only a specific list of terms, i.e. opposite of remove custom stopwords. I want this to happen in the corpus, not the tdm/dtm. For Part 2, I used a solution I got here: How to implement proximity rules in tm dictionary for counting words? This one happens on the tdm part! Is there a better solution for Part 2 where you use something like "tm_map(my.corpus, keepOnlyWords, customlist)"? Any help will be greatly appreciated. Thanks much!

In newer versions of tm this is a lot easier with the DataframeSource() function. "A data frame source interprets each row of the data frame x as a document. The first column must be named "doc_id" and contain a unique string identifier for each document. The second column must be named "text" and contain a "UTF-8" encoded string representing the document's content. Optional additional columns are used as document level metadata." So in this case: <pre class="prettyprint"><code>dd <-data.frame( doc_id=10:13, text=c("No wonder, then, that ever gathering volume from the mere transit ", "So that in many cases such a panic did he finally strike, that few ", "But there were still other and more vital practical influences at work", "Not even at the present day has the original prestige of the Sperm Whale") ,stringsAsFactors=F ) Corpus = VCorpus(DataframeSource(dd)) </code></pre>

Keep document ID with R corpus

Tags:

text

r

text-mining

corpus

tm

I have searched stackoverflow and the web and can only find partial solutions OR some that don't work due to changes in TM or qdap. Problem below:

I have a dataframe: ID and Text (Simple document id/name and then some text)

I have two issues:

Part 1: How can I create a tdm or dtm and maintain the document name/id? It only shows "character(0)" on inspect(tdm).
Part 2: I want to keep only a specific list of terms, i.e. opposite of remove custom stopwords. I want this to happen in the corpus, not the tdm/dtm.

For Part 2, I used a solution I got here: How to implement proximity rules in tm dictionary for counting words?

This one happens on the tdm part! Is there a better solution for Part 2 where you use something like "tm_map(my.corpus, keepOnlyWords, customlist)"?

Any help will be greatly appreciated. Thanks much!

215

asked Jul 01 '14 02:07

RUser

1 Answers

In newer versions of tm this is a lot easier with the DataframeSource() function.

"A data frame source interprets each row of the data frame x as a document. The first column must be named "doc_id" and contain a unique string identifier for each document. The second column must be named "text" and contain a "UTF-8" encoded string representing the document's content. Optional additional columns are used as document level metadata."

So in this case:

dd <-data.frame(
    doc_id=10:13,
    text=c("No wonder, then, that ever gathering volume from the mere transit ",
      "So that in many cases such a panic did he finally strike, that few ",
      "But there were still other and more vital practical influences at work",
      "Not even at the present day has the original prestige of the Sperm Whale")
    ,stringsAsFactors=F
 )

Corpus = VCorpus(DataframeSource(dd))

answered Oct 08 '22 10:10

Koot6133

Related questions
                            
                                Find out if column in R table includes duplicate values?
                            
                                Number values include comma -- how do I make these numeric? [duplicate]
                            
                                Problems executing script from command line in R. Error message: cannot find path specified
                            
                                How to multiply a single column in a data.frame by a number
                            
                                Include text control characters in plotmath expressions
                            
                                Aggregate a data frame based on unordered pairs of columns
                            
                                Setting constraints in constrOptim
                            
                                Efficiently reading specific lines from large files into R
                            
                                Excel Cell Coloring using xlsx
                            
                                R multiple statistics for multiple columns with data.table [duplicate]
                            
                                contour plot of a custom function in R
                            
                                Sine curve fit using lm and nls in R
                            
                                R - Faster Way to Calculate Rolling Statistics Over a Variable Interval
                            
                                Applying a custom function on data.table instead of using plyr and ddply
                            
                                How do I add random `NA`s into a data frame
                            
                                Check if a variable is xts or data.frame
                            
                                Extracting the numerical values of a xts object
                            
                                setting breaks and labels in ggplot
                            
                                R Shiny - ui.R seems to not recognize a dataframe read by server.R
                            
                                Return a data frame from function

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With