Opposite of unnest_tokens

Tags:

This is most likely a stupid question, but I've googled and googled and can't find a solution. I think it's because I don't know the right way to word my question to search.

I have a data frame that I have converted to tidy text format in R to get rid of stop words. I would now like to 'untidy' that data frame back to its original format.

What's the opposite / inverse command of unnest_tokens?

Edit: here is what the data I'm working with look like. I'm trying to replicate analyses from Silge and Robinson's Tidy Text book but using Italian opera librettos.

character = c("FIGARO", "SUSANNA", "CONTE", "CHERUBINO") 
line = c("Cinque... dieci.... venti... trenta... trentasei...quarantatre", "Ora sì ch'io son contenta; sembra fatto inver per me. Guarda un po', mio caro Figaro, guarda adesso il mio cappello.", "Susanna, mi sembri agitata e confusa.", "Il Conte ieri perché trovommi sol con Barbarina, il congedo mi diede; e se la Contessina, la mia bella comare, grazia non m'intercede, io vado via, io non ti vedo più, Susanna mia!") 
sample_df = data.frame(character, line)
sample_df

character line
FIGARO    Cinque... dieci.... venti... trenta... trentasei...quarantatre
SUSANNA   Ora sì ch'io son contenta; sembra fatto inver per me. Guarda un po', mio caro Figaro, guarda adesso il mio cappello.
CONTE     Susanna, mi sembri agitata e confusa.
CHERUBINO Il Conte ieri perché trovommi sol con Barbarina, il congedo mi diede; e se la Contessina, la mia bella comare, grazia non m'intercede, io vado via, io non ti vedo più, Susanna mia!

I turn it into tidy text so I can get rid of stop words:

tribble <- sample_df %>%
           unnest_tokens(word, line)
# Get rid of stop words
# I had to make my own list of stop words for 18th century Italian opera
itstopwords <- data_frame(text=mystopwords)
names(itstopwords)[names(itstopwords)=="text"] <- "word"
tribble2 <- tribble %>%
            anti_join(itstopwords)

Now I have something like this:

text    word
FIGARO  cinque
FIGARO  dieci
FIGARO  venti
FIGARO  trenta
...

I would like to get it back into the format of character name and the associated line to look at other things. Basically I would like the text in the same format it was before, but with stop words removed.

321

asked Oct 13 '17 16:10

Kate

1 Answers

Not a stupid question! The answer depends a bit on exactly what you are trying to do, but here would be my typical approach if I wanted to get my text back to its original form after some processing in its tidied form, using the group_by() function from dplyr.

First, let's go from raw text to a tidied format.

library(tidyverse)
library(tidytext)

tidy_austen <- janeaustenr::austen_books() %>%
    group_by(book) %>%
    mutate(linenumber = row_number()) %>%
    ungroup() %>%
    unnest_tokens(word, text)

tidy_austen
#> # A tibble: 725,055 x 3
#>    book                linenumber word       
#>    <fct>                    <int> <chr>      
#>  1 Sense & Sensibility          1 sense      
#>  2 Sense & Sensibility          1 and        
#>  3 Sense & Sensibility          1 sensibility
#>  4 Sense & Sensibility          3 by         
#>  5 Sense & Sensibility          3 jane       
#>  6 Sense & Sensibility          3 austen     
#>  7 Sense & Sensibility          5 1811       
#>  8 Sense & Sensibility         10 chapter    
#>  9 Sense & Sensibility         10 1          
#> 10 Sense & Sensibility         13 the        
#> # … with 725,045 more rows

The text is tidy now! But we can untidy it, back to something sort of like its original form. I typically approach this using group_by() and summarize() from dplyr, and str_c() from stringr. What does the text look like at the end, in this particular case?

tidy_austen %>% 
    group_by(book, linenumber) %>% 
    summarize(text = str_c(word, collapse = " ")) %>%
    ungroup()
#> # A tibble: 62,272 x 3
#>    book            linenumber text                                         
#>    <fct>                <int> <chr>                                        
#>  1 Sense & Sensib…          1 sense and sensibility                        
#>  2 Sense & Sensib…          3 by jane austen                               
#>  3 Sense & Sensib…          5 1811                                         
#>  4 Sense & Sensib…         10 chapter 1                                    
#>  5 Sense & Sensib…         13 the family of dashwood had long been settled…
#>  6 Sense & Sensib…         14 was large and their residence was at norland…
#>  7 Sense & Sensib…         15 their property where for many generations th…
#>  8 Sense & Sensib…         16 respectable a manner as to engage the genera…
#>  9 Sense & Sensib…         17 surrounding acquaintance the late owner of t…
#> 10 Sense & Sensib…         18 man who lived to a very advanced age and who…
#> # … with 62,262 more rows

^{Created on 2019-07-11 by the reprex package (v0.3.0)}

answered Oct 01 '22 06:10

Julia Silge

Related questions
                            
                                Adding elements to a list in for loop in R
                            
                                write.xlsx error in Error in .jnew and j.check in R
                            
                                Histogram conditional fill color
                            
                                find neighbouring elements of a matrix in R
                            
                                R tm removeWords function not removing words
                            
                                Passing Parameters to R Markdown
                            
                                ggplot2 scale x date?
                            
                                Trying to randomise a game of rock, paper, scissors in R
                            
                                Get Quantile values from geom_boxplot()
                            
                                Remove any digit only in first N characters
                            
                                xtable in R: Cannot get rid of row numbers [duplicate]
                            
                                Buttons: download button with scroller downloads only few rows
                            
                                Add an average line to an existing plot
                            
                                Reading a file on a network in R
                            
                                Second Y-Axis in a R plotly graph
                            
                                How to use R ggplot stat_summary to plot median and quartiles?
                            
                                The difference of na.rm and na.omit in R
                            
                                Replace values in R, "Yes" to 1 and "No" to 0
                            
                                Sequential citation numbering in R: separate numbers by hyphen, if sequential - add comma if not
                            
                                Efficiently convert a date column in data.table

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Opposite of unnest_tokens

Tags:

r

tidyr

tidyverse

tidytext

Kate

People also ask

1 Answers

Julia Silge

Recent Activity

Donate For Us