Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to replace substring values by matching reference values

Tags:

regex

r

match

I have phonetic transcriptions of utterances:

str <- c("aɪ nəʊ ɪts ɪts ðə sɪksθ əv ʤuːn",
       "wɛl ðə ʧæp nɛkst dɔːz ˈfaɪndɪŋ ɪt ˈvɛri əˈmjuːzɪŋ",
       "lʌvli bu(ː)ˈkeɪ əv ˈflaʊəz fə mi wɛl ðæts ɪt",
       "ðeə raɪt ləʊ ɪn ðə liːg ɑːnt ðeɪ",
       "kɔː wi θɔːt wi wɪʃt wiːd lɛft ˈɜːlɪə naʊ",
       "aɪ nəʊ s ðə biː ðə bɪg bɔɪ ðeɪl",
       "jeə bət ɪt s ə məʊl aɪ kən əˈʃʊə juː",
       "ɑː ʤəst eə haʊ aɪ juːzd tə dʊ jɪəz əˈgəʊ",
       "jeə dəʊnt ˈwʌri əˈbaʊt mi æn aɪm ɔːlˈraɪt")

I want to replace all diphthongs with numbers; the diphthongs and their matching replacement numbers are stored in a reference dataframe:

ref <- data.frame(
  diphthong = c("ɪə", "eɪ", "ʊə", "ɔɪ", "aɪ", "eə", "aʊ", "əʊ"),
  replacement = 1:8
)

I can replace each diphthong individually using gsub, store the result in a new vector, replace the next diphthong in that new vector, and so on:

a <- gsub("ɪə", "1", str)
b <- gsub("eɪ", "2", a)
c <- gsub("ʊə", "3", b)
d <- gsub("ɔɪ", "4", c)
e <- gsub("aɪ", "5", d)
f <- gsub("eə", "6", e)
g <- gsub("aʊ", "7", f)
h <- gsub("əʊ", "8", g)

While this gets me the desired result (see below), this method is repetitive and far from elegant. How can I achieve the replacements in one go?

Expected result:

[1] "5 n8 ɪts ɪts ðə sɪksθ əv ʤuːn"                    "wɛl ðə ʧæp nɛkst dɔːz ˈf5ndɪŋ ɪt ˈvɛri əˈmjuːzɪŋ"
[3] "lʌvli bu(ː)ˈk2 əv ˈfla3z fə mi wɛl ðæts ɪt"       "ð6 r5t l8 ɪn ðə liːg ɑːnt ð2"                    
[5] "kɔː wi θɔːt wi wɪʃt wiːd lɛft ˈɜːl1 n7"           "5 n8 s ðə biː ðə bɪg b4 ð2l"                     
[7] "j6 bət ɪt s ə m8l 5 kən əˈʃ3 juː"                 "ɑː ʤəst 6 h7 5 juːzd tə dʊ j1z əˈg8"             
[9] "j6 d8nt ˈwʌri əˈb7t mi æn 5m ɔːlˈr5t"
like image 358
Chris Ruehlemann Avatar asked Oct 21 '20 14:10

Chris Ruehlemann


3 Answers

You may create a regex out of the diphthong data to match each separate diphthong and use a single pass over the data replacing each match with the corresponding value from the replacement column:

library(stringr)
str <- c("aɪ nəʊ ɪts ɪts ðə sɪksθ əv ʤuːn",
        "wɛl ðə ʧæp nɛkst dɔːz ˈfaɪndɪŋ ɪt ˈvɛri əˈmjuːzɪŋ",
        "lʌvli bu(ː)ˈkeɪ əv ˈflaʊəz fə mi wɛl ðæts ɪt",
        "ðeə raɪt ləʊ ɪn ðə liːg ɑːnt ðeɪ",
        "kɔː wi θɔːt wi wɪʃt wiːd lɛft ˈɜːlɪə naʊ",
        "aɪ nəʊ s ðə biː ðə bɪg bɔɪ ðeɪl",
        "jeə bət ɪt s ə məʊl aɪ kən əˈʃʊə juː",
        "ɑː ʤəst eə haʊ aɪ juːzd tə dʊ jɪəz əˈgəʊ",
        "jeə dəʊnt ˈwʌri əˈbaʊt mi æn aɪm ɔːlˈraɪt")
 
ref <- data.frame(
   diphthong = c("ɪə", "eɪ", "ʊə", "ɔɪ", "aɪ", "eə", "aʊ", "əʊ"),
   replacement = 1:8
)
pat <- paste(ref$diphthong, collapse="|")
str_replace_all(str, pat, function(x) ref$replacement[ref$diphthong==x])

See the R demo. Output:

[1] "5 n8 ɪts ɪts ðə sɪksθ əv ʤuːn"                   
[2] "wɛl ðə ʧæp nɛkst dɔːz ˈf5ndɪŋ ɪt ˈvɛri əˈmjuːzɪŋ"
[3] "lʌvli bu(ː)ˈk2 əv ˈfl7əz fə mi wɛl ðæts ɪt"      
[4] "ð6 r5t l8 ɪn ðə liːg ɑːnt ð2"                    
[5] "kɔː wi θɔːt wi wɪʃt wiːd lɛft ˈɜːl1 n7"          
[6] "5 n8 s ðə biː ðə bɪg b4 ð2l"                     
[7] "j6 bət ɪt s ə m8l 5 kən əˈʃ3 juː"                
[8] "ɑː ʤəst 6 h7 5 juːzd tə dʊ j1z əˈg8"             
[9] "j6 d8nt ˈwʌri əˈb7t mi æn 5m ɔːlˈr5t"            

In this case, the regex is built using paste(ref$diphthong, collapse="|") and is just an alternation based pattern, ɪə|eɪ|ʊə|ɔɪ|aɪ|eə|aʊ|əʊ. The ref$replacement[ref$diphthong==x] part maps the diphthong found to its replacement value.

like image 164
Wiktor Stribiżew Avatar answered Oct 16 '22 18:10

Wiktor Stribiżew


You can make a simple for loop:

for(i in seq_len(nrow(ref))) {
  str <- gsub(ref$diphthong[i], ref$replacement[i], str)
}
like image 1
GKi Avatar answered Oct 16 '22 18:10

GKi


it's enough:

stringr::str_replace_all(str, ref)

where ref is defined as:

ref <- setNames(as.character(1:8), c("ɪə", "eɪ", "ʊə", "ɔɪ", "aɪ", "eə", "aʊ", "əʊ"))

In case ref is already defined as a dataframe, you can just converted to a named vector in this way:

ref <- setNames(as.character(ref$replacement), ref$diphthong)
like image 1
Edo Avatar answered Oct 16 '22 18:10

Edo