I am trying to build a wordcloud of the jeopardy dataset found here: https://www.reddit.com/r/datasets/comments/1uyd0t/200000_jeopardy_questions_in_a_json_file/
My code is as follows:
library(tm)
library(SnowballC)
library(wordcloud)
jeopQ <- read.csv('JEOPARDY_CSV.csv', stringsAsFactors = FALSE)
jeopCorpus <- Corpus(VectorSource(jeopQ$Question))
jeopCorpus <- tm_map(jeopCorpus, PlainTextDocument)
jeopCorpus <- tm_map(jeopCorpus, removePunctuation)
jeopCorpus <- tm_map(jeopCorpus, removeWords, c('the', 'this', stopwords('english')))
jeopCorpus <- tm_map(jeopCorpus, stemDocument)
wordcloud(jeopCorpus, max.words = 100, random.order = FALSE)
The words 'the' and 'this' are still appearing in the wordcloud. Why is this happening and how can I fix it?
The problem lies in the fact that you didn't perform a lower case action. A lot of questions start with "The". The stopwords are all in lower case, e.g. "the" and "this". Since "The" != "the", "The" it is not removed from the corpus
If you use the code below it should work correctly:
jeopCorpus <- tm_map(jeopCorpus, content_transformer(tolower))
jeopCorpus <- tm_map(jeopCorpus, removeWords, stopwords('english'))
jeopCorpus <- tm_map(jeopCorpus, removePunctuation)
jeopCorpus <- tm_map(jeopCorpus, PlainTextDocument)
jeopCorpus <- tm_map(jeopCorpus, stemDocument)
wordcloud(jeopCorpus, max.words = 100, random.order = FALSE)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With