I want to cluster words that are similar using R and the tidytext
package.
I have created my tokens and would now like to convert it to a matrix in order to cluster it. I would like to try out a number of token techniques to see which provides the most compact clusters.
My code is as follows (taken from the docs of widyr
package). I just cant make the next step. Can anyone help?
library(janeaustenr)
library(dplyr)
library(tidytext)
# Comparing Jane Austen novels
austen_words <- austen_books() %>%
unnest_tokens(word, text)
# closest books to each other
closest <- austen_words %>%
pairwise_similarity(book, word, n) %>%
arrange(desc(similarity))
I know what to create a clustering algorithm around closest
.
This code will get me there but i don't know how to go from the previous section to the matrix.
d <- dist(m)
kfit <- kmeans(d, 4, nstart=100)
For tidy text mining, the token that is stored in each row is most often a single word, but can also be an n-gram, sentence, or paragraph. In the tidytext package, we provide functionality to tokenize by commonly used units of text like these and convert to a one-term-per-row format.
The Life-Changing Magic of Tidying Text In this package, we provide functions and supporting data sets to allow conversion of text to and from tidy formats, and to switch seamlessly between tidy tools and existing text mining packages.
To understand why this works, we'll first view the stop_words object to see that it contains a variable called word , listing stop words from a number of different lexicons. So we can anti_join to our data frame on this column to filter out any rows that match these words.
You can create an appropriate matrix for this via casting from tidytext. There are several functions to cast_
, such as cast_sparse()
.
Let's use four example books, and cluster the chapters within the books:
library(tidyverse)
library(tidytext)
library(gutenbergr)
my_mirror <- "http://mirrors.xmission.com/gutenberg/"
books <- gutenberg_download(c(36, 158, 164, 345),
meta_fields = "title",
mirror = my_mirror)
books %>%
count(title)
#> # A tibble: 4 x 2
#> title n
#> * <chr> <int>
#> 1 Dracula 15568
#> 2 Emma 16235
#> 3 The War of the Worlds 6474
#> 4 Twenty Thousand Leagues under the Sea 12135
# break apart the chapters
by_chapter <- books %>%
group_by(title) %>%
mutate(chapter = cumsum(str_detect(text, regex("^chapter ",
ignore_case = TRUE)))) %>%
ungroup() %>%
filter(chapter > 0) %>%
unite(document, title, chapter)
glimpse(by_chapter)
#> Rows: 50,315
#> Columns: 3
#> $ gutenberg_id <int> 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, …
#> $ text <chr> "CHAPTER ONE", "", "THE EVE OF THE WAR", "", "", "No one…
#> $ document <chr> "The War of the Worlds_1", "The War of the Worlds_1", "T…
words_sparse <- by_chapter %>%
unnest_tokens(word, text) %>%
anti_join(get_stopwords(source = "smart")) %>%
count(document, word, sort = TRUE) %>%
cast_sparse(document, word, n)
#> Joining, by = "word"
class(words_sparse)
#> [1] "dgCMatrix"
#> attr(,"package")
#> [1] "Matrix"
dim(words_sparse)
#> [1] 182 18124
The words_sparse
object is a sparse matrix created via cast_sparse()
. You can learn more about converting back and forth from tidy and non-tidy formats for text in this chapter.
Now that you have your matrix of word counts (i.e. a document-term matrix, which you could consider weighting by tf-idf instead of counts) you can use kmeans()
. How many chapters from each book were clustered together?
kfit <- kmeans(words_sparse, centers = 4)
enframe(kfit$cluster, value = "cluster") %>%
separate(name, into = c("title", "chapter"), sep = "_") %>%
count(title, cluster) %>%
arrange(cluster)
#> # A tibble: 8 x 3
#> title cluster n
#> <chr> <int> <int>
#> 1 Dracula 1 26
#> 2 The War of the Worlds 1 1
#> 3 Dracula 2 28
#> 4 Emma 2 9
#> 5 The War of the Worlds 2 26
#> 6 Twenty Thousand Leagues under the Sea 2 9
#> 7 Twenty Thousand Leagues under the Sea 3 37
#> 8 Emma 4 46
Created on 2021-02-04 by the reprex package (v1.0.0)
One cluster is all Emma, one cluster is all Twenty Thousand Leagues under the Sea, and one cluster has chapters from all four books.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With