I've looked at the other similar questions that have been posted here (like this), but the problem persists.
I have a dataframe of textual data, which I need to stem. So I'm converting it into a corpus, stemming it, then completing the words from the stems, and then trying to get a dataframe of text as output.
myCorpus <- Corpus(VectorSource(textDf$text))
myCorpus <- tm_map(myCorpus, removeWords, stopwords('english'))
myCorpus <- tm_map(myCorpus, content_transformer(tolower))
myCorpus <- tm_map(myCorpus, removePunctuation)
dictCorpus <- myCorpus
myCorpus <- tm_map(myCorpus, stemDocument)
myCorpus <- tm_map(myCorpus, stemCompletion, dictionary=dictCorpus)
Now I'm trying to get a dataframe back from this corpus so I've tried these following commands.
dataframe<-data.frame(text=unlist(sapply(myCorpus, '[', "content")),
stringsAsFactors=F)
and
dataframe<-data.frame(text=unlist(sapply(myCorpus,
[)), stringsAsFactors=F)
and also
dataframe <-
data.frame(id=sapply(corpus, meta, "id"),
text=unlist(lapply(sapply(corpus, '[', "content"),paste,collapse="\n")),
stringsAsFactors=FALSE)
from this link
All of them produce the following error:
Error in UseMethod("meta", x) :
no applicable method for 'meta' applied to an object of class "character"
Any help would be greatly appreciated.
This ought to do it:
data.frame(text = sapply(myCorpus, as.character), stringsAsFactors = FALSE)
edited with working solution, using crude
as example
The problem here is that you cannot apply stemCompletion
as a transformation.
getTransformations()
## [1] "removeNumbers" "removePunctuation" "removeWords" "stemDocument" "stripWhitespace"
does not include stemCompletion
, which takes a vector of stemmed tokens as input.
So this should do it: first you extract the transformed texts and tokenise them, then complete the stems, then paste back together. Here I have illustrated the solution using the built-in crude
corpus.
data(crude)
myCorpus <- crude
myCorpus <- tm_map(myCorpus, removeWords, stopwords('english'))
myCorpus <- tm_map(myCorpus, content_transformer(tolower))
myCorpus <- tm_map(myCorpus, removePunctuation)
dictCorpus <- myCorpus
myCorpus <- tm_map(myCorpus, stemDocument)
# tokenize the corpus
myCorpusTokenized <- lapply(myCorpus, scan_tokenizer)
# stem complete each token vector
myTokensStemCompleted <- lapply(myCorpusTokenized, stemCompletion, dictCorpus)
# concatenate tokens by document, create data frame
myDf <- data.frame(text = sapply(myTokensStemCompleted, paste, collapse = " "), stringsAsFactors = FALSE)
I've redone some of your earlier code with magrittr, just cause.
library(dplyr)
library(tm)
dictCorpus =
c("I love my cat", "Cullen bae is bae", "4ever alone :(") %>%
VectorSource %>%
Corpus %>%
tm_map(removeWords, stopwords('english')) %>%
tm_map(content_transformer(tolower)) %>%
tm_map(removePunctuation)
myCorpus =
dictCorpus %>%
tm_map(stemDocument) %>%
tm_map(stemCompletion, dictionary=dictCorpus)
data =
data_frame(object =
myCorpus %>%
`class<-`("list") %>%
use_series(content) ) %>%
rowwise %>%
mutate(content =
object %>%
names %>%
extract(1) )
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With