Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Creating "word" cloud of phrases, not individual words in R

Tags:

r

word-cloud

I am trying to make a word cloud from a list of phrases, many of which are repeated, instead of from individual words. My data looks something like this, with one column of my data frame being a list of phrases.

df$names <- c("John", "John", "Joseph A", "Mary A", "Mary A", "Paul H C", "Paul H C")

I would like to make a word cloud where all of these names are treated as individual phrases whose frequency is displayed, not the words which make them up. The code I have been using looks like:

df.corpus <- Corpus(DataframeSource(data.frame(df$names)))
df.corpus <- tm_map(client.corpus, function(x) removeWords(x, stopwords("english")))
#turning that corpus into a tDM
tdm <- TermDocumentMatrix(df.corpus)
m <- as.matrix(tdm)
v <- sort(rowSums(m),decreasing=TRUE)
d <- data.frame(word = names(v),freq=v)
pal <- brewer.pal(9, "BuGn")
pal <- pal[-(1:2)]
#making a worcloud
png("wordcloud.png", width=1280,height=800)
wordcloud(d$word,d$freq, scale=c(8,.3),min.freq=2,max.words=100, random.order=T, rot.per=.15, colors="black", vfont=c("sans serif","plain"))
dev.off()

This creates a word cloud, but it is of each component word, not of the phrases. So, I see the relative frequency of "A". "H", "John" etc instead of the relative frequency of "Joseph A", "Mary A", etc, which is what I want.

I'm sure this isn't that complicated to fix, but I can't figure it out! I would appreciate any help.

like image 260
verybadatthis Avatar asked Nov 14 '14 20:11

verybadatthis


People also ask

Can I make a word cloud with phrases?

In the word cloud, select the word you wish to combine with other words (eg, “convenient”). Type in a word or phrase you wish to combine the word with (eg, type in “ease”), and press Enter. Repeat this process for all other words or phrases you wish to combine (eg, "easy"), until you have exhausted the synonyms.

How do I exclude words from wordcloud in R?

The Ignore list will appear to the right of the word cloud. You'll notice that by default Displayr automatically hides a list of common, uninteresting words (eg "in", "all", etc). To add words to the Ignore list, click on the word in the word cloud and drag it to the Ignore list.

How do I keep two words together in word cloud?

Use a tilde symbol ~ to keep two words together in the cloud. Otherwise, the two words will be scattered apart in the cloud.

Can you make a word cloud in a specific shape?

You can pick a shape, select colors and fonts, and control how WordClouds generates the cloud. Artists will especially appreciate this feature. You can draw your own image with a transparent background, then upload it to WordClouds.


2 Answers

Your difficulty is that each element of df$names is being treated as "document" by the functions of tm. For example, the document John A contains the words John and A. It sounds like you want to keep the names as is, and just count up their occurrence - you can just use table for that.

library(wordcloud)
df<-data.frame(theNames=c("John", "John", "Joseph A", "Mary A", "Mary A", "Paul H C", "Paul H C"))
tb<-table(df$theNames)
wordcloud(names(tb),as.numeric(tb), scale=c(8,.3),min.freq=1,max.words=100, random.order=T, rot.per=.15, colors="black", vfont=c("sans serif","plain"))

enter image description here

like image 188
keegan Avatar answered Sep 29 '22 00:09

keegan


Install RWeka and its dependencies, then try this:

library(RWeka)
BigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2))
# ... other tokenizers
tok <- BigramTokenizer
tdmgram <- TermDocumentMatrix(df.corpus, control = list(tokenize = tok))
#... create wordcloud

The tokenizer-line above chops your text into phrases of length 2.
More specifically, it creates phrases of minlength 2 and maxlength 2.
Using Weka's general NGramTokenizer Algorithm, You can create different tokenizers (e.g minlength 1, maxlength 2), and you'll probably want to experiment with different lengths. You can also call them tok1, tok2 instead of the verbose "BigramTokenizer" I've used above.

like image 30
knb Avatar answered Sep 29 '22 00:09

knb