In my data, the strings in one column (TCU_clean) are substrings in another column (TCU_clean_collpsd); in a third column (w_c7_collpsd), which is the key target column, the same substrings appear but are extended by PoS-tags. For example, the first word in TCU_clean and TCU_clean_collpsd is like; in w_c7_collpsd it is given as like_VV0.
My goal is to trim w_c7_collpsd (and also TCU_clean_collpsd, but that's of secondary importance) in such a way that the words in w_c7_collpsd map exactly onto the words in TCU_clean.
This is the desired output:
Sequ TCU_clean w_c7_collpsd TCU_clean_collpsd
<dbl> <chr> <chr> <chr>
1 1 like I do n't understand like_VV0 I_PPIS1 do_VD0 n't_XX understand_VVI like I do n't understand
2 1 sorry sorry_JJ sorry
3 1 like how old 's your mom like_II how_RGQ old_JJ 's_VBZ your_APPGE mom_NN1 like how old 's your mom
4 2 when was that when_RRQ was_VBDZ that_DD1 when was that
Any help with this task is appeciated. Solutions from the tidyverse are preferred but others are welcome too.
Reproducible data:
df <- structure(list(Sequ = c(1, 1, 1, 2), TCU_clean = c("like I do n't understand",
"sorry", "like how old 's your mom", "when was that"), w_c7_collpsd = c("like_VV0 I_PPIS1 do_VD0 n't_XX understand_VVI sorry_JJ like_II how_RGQ old_JJ 's_VBZ your_APPGE mom_NN1",
"like_VV0 I_PPIS1 do_VD0 n't_XX understand_VVI sorry_JJ like_II how_RGQ old_JJ 's_VBZ your_APPGE mom_NN1",
"like_VV0 I_PPIS1 do_VD0 n't_XX understand_VVI sorry_JJ like_II how_RGQ old_JJ 's_VBZ your_APPGE mom_NN1",
"when_RRQ was_VBDZ that_DD1"), TCU_clean_collpsd = c("like I do n't understand sorry like how old 's your mom",
"like I do n't understand sorry like how old 's your mom", "like I do n't understand sorry like how old 's your mom",
"when was that")), row.names = c(NA, -4L), class = c("tbl_df",
"tbl", "data.frame"))
One approach is to count the number of words (n) in TCU_clean and sequentially extract the first n words in w_c7_collpsd:
trim_func <- function(clean,collpsd){
fulltext <- strsplit(collpsd[1]," ")[[1]]
output <- c()
for(sentence in clean){
sentence_split <- strsplit(sentence," ")[[1]]
seq_n_word <- seq(length(sentence_split))
output <- c(output,paste(fulltext[seq_n_word],collapse = " "))
fulltext <- fulltext[-seq_n_word]
}
return(output)
}
df %>%
mutate(w_c7_collpsd=trim_func(TCU_clean,w_c7_collpsd),.by=Sequ)
# A tibble: 4 × 4
Sequ TCU_clean w_c7_collpsd TCU_clean_collpsd
<dbl> <chr> <chr> <chr>
1 1 like I do n't understand like_VV0 I_PPIS1 do_VD0 n't_XX understand_VVI like I do n't und…
2 1 sorry sorry_JJ like I do n't und…
3 1 like how old 's your mom like_II how_RGQ old_JJ 's_VBZ your_APPGE mom_NN1 like I do n't und…
4 2 when was that when_RRQ was_VBDZ that_DD1 when was that
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With