Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Map strings in one vector, which are substrings of longer strings in another vector, to the same substrings in the longer strings

Tags:

r

dplyr

purrr

In my data, the strings in one column (TCU_clean) are substrings in another column (TCU_clean_collpsd); in a third column (w_c7_collpsd), which is the key target column, the same substrings appear but are extended by PoS-tags. For example, the first word in TCU_clean and TCU_clean_collpsd is like; in w_c7_collpsd it is given as like_VV0.

My goal is to trim w_c7_collpsd (and also TCU_clean_collpsd, but that's of secondary importance) in such a way that the words in w_c7_collpsd map exactly onto the words in TCU_clean.

This is the desired output:

     Sequ   TCU_clean                   w_c7_collpsd                                        TCU_clean_collpsd                          
    <dbl>   <chr>                       <chr>                                               <chr>                                      
1     1     like I do n't understand    like_VV0 I_PPIS1 do_VD0 n't_XX understand_VVI       like I do n't understand
2     1     sorry                       sorry_JJ                                            sorry 
3     1     like how old 's your mom    like_II how_RGQ old_JJ 's_VBZ your_APPGE mom_NN1    like how old 's your mom
4     2     when was that               when_RRQ was_VBDZ that_DD1                          when was that 

Any help with this task is appeciated. Solutions from the tidyverse are preferred but others are welcome too.

Reproducible data:

df <- structure(list(Sequ = c(1, 1, 1, 2), TCU_clean = c("like I do n't understand", 
                                                         "sorry", "like how old 's your mom", "when was that"), w_c7_collpsd = c("like_VV0 I_PPIS1 do_VD0 n't_XX understand_VVI sorry_JJ like_II how_RGQ old_JJ 's_VBZ your_APPGE mom_NN1", 
                                                                                                                                 "like_VV0 I_PPIS1 do_VD0 n't_XX understand_VVI sorry_JJ like_II how_RGQ old_JJ 's_VBZ your_APPGE mom_NN1", 
                                                                                                                                 "like_VV0 I_PPIS1 do_VD0 n't_XX understand_VVI sorry_JJ like_II how_RGQ old_JJ 's_VBZ your_APPGE mom_NN1", 
                                                                                                                                 "when_RRQ was_VBDZ that_DD1"), TCU_clean_collpsd = c("like I do n't understand sorry like how old 's your mom", 
                                                                                                                                                                                      "like I do n't understand sorry like how old 's your mom", "like I do n't understand sorry like how old 's your mom", 
                                                                                                                                                                                      "when was that")), row.names = c(NA, -4L), class = c("tbl_df", 
                                                                                                                                                                                                                                           "tbl", "data.frame"))
like image 618
Chris Ruehlemann Avatar asked Jan 18 '26 08:01

Chris Ruehlemann


1 Answers

One approach is to count the number of words (n) in TCU_clean and sequentially extract the first n words in w_c7_collpsd:

trim_func <- function(clean,collpsd){
  fulltext <- strsplit(collpsd[1]," ")[[1]]
  output <- c()
  for(sentence in clean){
    sentence_split <- strsplit(sentence," ")[[1]]
    seq_n_word <- seq(length(sentence_split))
    output <- c(output,paste(fulltext[seq_n_word],collapse = " "))
    fulltext <- fulltext[-seq_n_word]
  }
  return(output)
}

df %>%
  mutate(w_c7_collpsd=trim_func(TCU_clean,w_c7_collpsd),.by=Sequ)

# A tibble: 4 × 4
   Sequ TCU_clean                w_c7_collpsd                                     TCU_clean_collpsd 
  <dbl> <chr>                    <chr>                                            <chr>             
1     1 like I do n't understand like_VV0 I_PPIS1 do_VD0 n't_XX understand_VVI    like I do n't und…
2     1 sorry                    sorry_JJ                                         like I do n't und…
3     1 like how old 's your mom like_II how_RGQ old_JJ 's_VBZ your_APPGE mom_NN1 like I do n't und…
4     2 when was that            when_RRQ was_VBDZ that_DD1                       when was that  
like image 168
one Avatar answered Jan 20 '26 22:01

one