Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

TidyText Clustering

I want to cluster words that are similar using R and the tidytext package. I have created my tokens and would now like to convert it to a matrix in order to cluster it. I would like to try out a number of token techniques to see which provides the most compact clusters.

My code is as follows (taken from the docs of widyr package). I just cant make the next step. Can anyone help?

library(janeaustenr)
library(dplyr)
library(tidytext)

# Comparing Jane Austen novels
austen_words <- austen_books() %>%
  unnest_tokens(word, text) 

# closest books to each other
closest <- austen_words %>%
  pairwise_similarity(book, word, n) %>%
  arrange(desc(similarity))

I know what to create a clustering algorithm around closest. This code will get me there but i don't know how to go from the previous section to the matrix.

d <- dist(m)
kfit <- kmeans(d, 4, nstart=100)
like image 679
John Smith Avatar asked Feb 03 '21 15:02

John Smith


People also ask

How does tidytext work?

For tidy text mining, the token that is stored in each row is most often a single word, but can also be an n-gram, sentence, or paragraph. In the tidytext package, we provide functionality to tokenize by commonly used units of text like these and convert to a one-term-per-row format.

What is Tidytext?

The Life-Changing Magic of Tidying Text In this package, we provide functions and supporting data sets to allow conversion of text to and from tidy formats, and to switch seamlessly between tidy tools and existing text mining packages.

Why do we use anti join on stop words?

To understand why this works, we'll first view the stop_words object to see that it contains a variable called word , listing stop words from a number of different lexicons. So we can anti_join to our data frame on this column to filter out any rows that match these words.


1 Answers

You can create an appropriate matrix for this via casting from tidytext. There are several functions to cast_, such as cast_sparse().

Let's use four example books, and cluster the chapters within the books:

library(tidyverse)
library(tidytext)
library(gutenbergr)
my_mirror <- "http://mirrors.xmission.com/gutenberg/"

books <- gutenberg_download(c(36, 158, 164, 345),
                            meta_fields = "title",
                            mirror = my_mirror)

books %>%
  count(title)
#> # A tibble: 4 x 2
#>   title                                     n
#> * <chr>                                 <int>
#> 1 Dracula                               15568
#> 2 Emma                                  16235
#> 3 The War of the Worlds                  6474
#> 4 Twenty Thousand Leagues under the Sea 12135

# break apart the chapters
by_chapter <- books %>%
  group_by(title) %>%
  mutate(chapter = cumsum(str_detect(text, regex("^chapter ", 
                                                 ignore_case = TRUE)))) %>%
  ungroup() %>%
  filter(chapter > 0) %>%
  unite(document, title, chapter)

glimpse(by_chapter)
#> Rows: 50,315
#> Columns: 3
#> $ gutenberg_id <int> 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, …
#> $ text         <chr> "CHAPTER ONE", "", "THE EVE OF THE WAR", "", "", "No one…
#> $ document     <chr> "The War of the Worlds_1", "The War of the Worlds_1", "T…

words_sparse <- by_chapter %>%
  unnest_tokens(word, text) %>% 
  anti_join(get_stopwords(source = "smart")) %>%
  count(document, word, sort = TRUE) %>%
  cast_sparse(document, word, n)
#> Joining, by = "word"

class(words_sparse)
#> [1] "dgCMatrix"
#> attr(,"package")
#> [1] "Matrix"
dim(words_sparse)
#> [1]   182 18124

The words_sparse object is a sparse matrix created via cast_sparse(). You can learn more about converting back and forth from tidy and non-tidy formats for text in this chapter.

Now that you have your matrix of word counts (i.e. a document-term matrix, which you could consider weighting by tf-idf instead of counts) you can use kmeans(). How many chapters from each book were clustered together?

kfit <- kmeans(words_sparse, centers = 4)

enframe(kfit$cluster, value = "cluster") %>%
  separate(name, into = c("title", "chapter"), sep = "_") %>%
  count(title, cluster) %>%
  arrange(cluster)
#> # A tibble: 8 x 3
#>   title                                 cluster     n
#>   <chr>                                   <int> <int>
#> 1 Dracula                                     1    26
#> 2 The War of the Worlds                       1     1
#> 3 Dracula                                     2    28
#> 4 Emma                                        2     9
#> 5 The War of the Worlds                       2    26
#> 6 Twenty Thousand Leagues under the Sea       2     9
#> 7 Twenty Thousand Leagues under the Sea       3    37
#> 8 Emma                                        4    46

Created on 2021-02-04 by the reprex package (v1.0.0)

One cluster is all Emma, one cluster is all Twenty Thousand Leagues under the Sea, and one cluster has chapters from all four books.

like image 195
Julia Silge Avatar answered Oct 20 '22 23:10

Julia Silge