Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How do I combine text names within an ordered transcription of dialogue?

Say I have this data:

df <- data.frame(x = c("Tom: I like cheese.", 
                       "Tom: Cheese is good.", 
                       "Tom: Muenster is my favorite.", 
                       "Bob: No, I like Cheddar.", 
                       "Tom: You're wrong. I think cheddar is only good on burgers.", 
                       "Gina: But what about American on burgers?", 
                       "Gina: That's better.", 
                       "Bob: Yeah, I agree with Gina.", 
                       "Bob: American is better on burgers. Cheddar is for grating on nachos."))

I want to turn it into this data:

df <- data.frame(x = c("Tom: I like cheese. Cheese is good. Muenster is my favorite.", 
                       "Bob: No, I like Cheddar.", 
                       "Tom: You're wrong. I think cheddar is only good on burgers.", 
                       "Gina: But what about American on burgers? That's better.", 
                       "Bob: Yeah, I agree with Gina. American is better on burgers. Cheddar is for grating on nachos."))

Basically, I want to cut the text including and before the colon on any instance of text that already has had a recent name.

I am struggling with trying to figure out how to do it in a way that doesn't group the entire "Tom:"'s and "Gina:"'s together and remove them all but for the first instance. I want the later mentions of names to restart the loop.

like image 212
Tom Avatar asked Sep 01 '25 15:09

Tom


2 Answers

We can use tidyr to split the speaker and what they say into columns, then use dplyr to combine runs of the same speaker. For example

df |> 
  tidyr::separate_wider_delim(x, ": ", names=c("speaker", "words")) |>
  mutate(instance = consecutive_id(speaker)) |>
  summarize(speaker = first(speaker), text=paste(words, collapse=" "), .by=instance)

returns

  instance speaker text                                                                                 
     <int> <chr>   <chr>                                                                                
1        1 Tom     I like cheese. Cheese is good. Muenster is my favorite.                              
2        2 Bob     No, I like Cheddar.                                                                  
3        3 Tom     You're wrong. I think cheddar is only good on burgers.                               
4        4 Gina    But what about American on burgers? That's better.                                   
5        5 Bob     Yeah, I agree with Gina. American is better on burgers. Cheddar is for grating on na…
like image 188
MrFlick Avatar answered Sep 04 '25 06:09

MrFlick


Using data.table, split on ": ", group by relid, then paste it back per group:

df[, c("name", "text") := tstrsplit(x, ": ", fixed = TRUE) 
   ][, .(text = paste(text, collapse = " ")), by = .(name, rleid(name))
     ][, -2]

#      name                                                                                  text
#    <char>                                                                                 <char>
# 1:    Tom                                 I like cheese. Cheese is good. Muenster is my favorite.
# 2:    Bob                                                                     No, I like Cheddar.
# 3:    Tom                                   You're wrong. I think cheddar is only good on burgers.
# 4:   Gina                                       But what about American on burgers? That's better.
# 5:    Bob Yeah, I agree with Gina. American is better on burgers. Cheddar is for grating on nachos.
like image 34
zx8754 Avatar answered Sep 04 '25 04:09

zx8754