Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Keep document ID with R corpus

I have searched stackoverflow and the web and can only find partial solutions OR some that don't work due to changes in TM or qdap. Problem below:

I have a dataframe: ID and Text (Simple document id/name and then some text)

I have two issues:

Part 1: How can I create a tdm or dtm and maintain the document name/id? It only shows "character(0)" on inspect(tdm).
Part 2: I want to keep only a specific list of terms, i.e. opposite of remove custom stopwords. I want this to happen in the corpus, not the tdm/dtm.

For Part 2, I used a solution I got here: How to implement proximity rules in tm dictionary for counting words?

This one happens on the tdm part! Is there a better solution for Part 2 where you use something like "tm_map(my.corpus, keepOnlyWords, customlist)"?

Any help will be greatly appreciated. Thanks much!

like image 215
RUser Avatar asked Jul 01 '14 02:07

RUser


People also ask

What does corpus () do in R?

Corpus is an R text processing package with full support for international text (Unicode). It includes functions for reading data from newline-delimited JSON files, for normalizing and tokenizing text, for searching for term occurrences, and for computing term occurrence frequencies (including n-grams).

What is corpus in tm?

The main structure for managing documents in tm is a so-called Corpus, representing a collection of text documents. A corpus is an abstract concept, and there can exist several implementations in parallel.

What is corpus object in R?

Notes for “Text Mining with R: A Tidy Approach” A corpus object, however, is a data structure for text data before tokenization. One common example is Corpus objects from the tm package. These store text alongside metadata, which may include an ID, date/time, title, or language for each document.


1 Answers

In newer versions of tm this is a lot easier with the DataframeSource() function.

"A data frame source interprets each row of the data frame x as a document. The first column must be named "doc_id" and contain a unique string identifier for each document. The second column must be named "text" and contain a "UTF-8" encoded string representing the document's content. Optional additional columns are used as document level metadata."

So in this case:

dd <-data.frame(
    doc_id=10:13,
    text=c("No wonder, then, that ever gathering volume from the mere transit ",
      "So that in many cases such a panic did he finally strike, that few ",
      "But there were still other and more vital practical influences at work",
      "Not even at the present day has the original prestige of the Sperm Whale")
    ,stringsAsFactors=F
 )

Corpus = VCorpus(DataframeSource(dd))
like image 50
Koot6133 Avatar answered Oct 08 '22 10:10

Koot6133