I have phonetic transcriptions of utterances:
str <- c("aɪ nəʊ ɪts ɪts ðə sɪksθ əv ʤuːn",
"wɛl ðə ʧæp nɛkst dɔːz ˈfaɪndɪŋ ɪt ˈvɛri əˈmjuːzɪŋ",
"lʌvli bu(ː)ˈkeɪ əv ˈflaʊəz fə mi wɛl ðæts ɪt",
"ðeə raɪt ləʊ ɪn ðə liːg ɑːnt ðeɪ",
"kɔː wi θɔːt wi wɪʃt wiːd lɛft ˈɜːlɪə naʊ",
"aɪ nəʊ s ðə biː ðə bɪg bɔɪ ðeɪl",
"jeə bət ɪt s ə məʊl aɪ kən əˈʃʊə juː",
"ɑː ʤəst eə haʊ aɪ juːzd tə dʊ jɪəz əˈgəʊ",
"jeə dəʊnt ˈwʌri əˈbaʊt mi æn aɪm ɔːlˈraɪt")
I want to replace all diphthongs with numbers; the diphthongs and their matching replacement numbers are stored in a reference dataframe:
ref <- data.frame(
diphthong = c("ɪə", "eɪ", "ʊə", "ɔɪ", "aɪ", "eə", "aʊ", "əʊ"),
replacement = 1:8
)
I can replace each diphthong individually using gsub
, store the result in a new vector, replace the next diphthong in that new vector, and so on:
a <- gsub("ɪə", "1", str)
b <- gsub("eɪ", "2", a)
c <- gsub("ʊə", "3", b)
d <- gsub("ɔɪ", "4", c)
e <- gsub("aɪ", "5", d)
f <- gsub("eə", "6", e)
g <- gsub("aʊ", "7", f)
h <- gsub("əʊ", "8", g)
While this gets me the desired result (see below), this method is repetitive and far from elegant. How can I achieve the replacements in one go?
Expected result:
[1] "5 n8 ɪts ɪts ðə sɪksθ əv ʤuːn" "wɛl ðə ʧæp nɛkst dɔːz ˈf5ndɪŋ ɪt ˈvɛri əˈmjuːzɪŋ"
[3] "lʌvli bu(ː)ˈk2 əv ˈfla3z fə mi wɛl ðæts ɪt" "ð6 r5t l8 ɪn ðə liːg ɑːnt ð2"
[5] "kɔː wi θɔːt wi wɪʃt wiːd lɛft ˈɜːl1 n7" "5 n8 s ðə biː ðə bɪg b4 ð2l"
[7] "j6 bət ɪt s ə m8l 5 kən əˈʃ3 juː" "ɑː ʤəst 6 h7 5 juːzd tə dʊ j1z əˈg8"
[9] "j6 d8nt ˈwʌri əˈb7t mi æn 5m ɔːlˈr5t"
You may create a regex out of the diphthong data to match each separate diphthong and use a single pass over the data replacing each match with the corresponding value from the replacement column:
library(stringr)
str <- c("aɪ nəʊ ɪts ɪts ðə sɪksθ əv ʤuːn",
"wɛl ðə ʧæp nɛkst dɔːz ˈfaɪndɪŋ ɪt ˈvɛri əˈmjuːzɪŋ",
"lʌvli bu(ː)ˈkeɪ əv ˈflaʊəz fə mi wɛl ðæts ɪt",
"ðeə raɪt ləʊ ɪn ðə liːg ɑːnt ðeɪ",
"kɔː wi θɔːt wi wɪʃt wiːd lɛft ˈɜːlɪə naʊ",
"aɪ nəʊ s ðə biː ðə bɪg bɔɪ ðeɪl",
"jeə bət ɪt s ə məʊl aɪ kən əˈʃʊə juː",
"ɑː ʤəst eə haʊ aɪ juːzd tə dʊ jɪəz əˈgəʊ",
"jeə dəʊnt ˈwʌri əˈbaʊt mi æn aɪm ɔːlˈraɪt")
ref <- data.frame(
diphthong = c("ɪə", "eɪ", "ʊə", "ɔɪ", "aɪ", "eə", "aʊ", "əʊ"),
replacement = 1:8
)
pat <- paste(ref$diphthong, collapse="|")
str_replace_all(str, pat, function(x) ref$replacement[ref$diphthong==x])
See the R demo. Output:
[1] "5 n8 ɪts ɪts ðə sɪksθ əv ʤuːn"
[2] "wɛl ðə ʧæp nɛkst dɔːz ˈf5ndɪŋ ɪt ˈvɛri əˈmjuːzɪŋ"
[3] "lʌvli bu(ː)ˈk2 əv ˈfl7əz fə mi wɛl ðæts ɪt"
[4] "ð6 r5t l8 ɪn ðə liːg ɑːnt ð2"
[5] "kɔː wi θɔːt wi wɪʃt wiːd lɛft ˈɜːl1 n7"
[6] "5 n8 s ðə biː ðə bɪg b4 ð2l"
[7] "j6 bət ɪt s ə m8l 5 kən əˈʃ3 juː"
[8] "ɑː ʤəst 6 h7 5 juːzd tə dʊ j1z əˈg8"
[9] "j6 d8nt ˈwʌri əˈb7t mi æn 5m ɔːlˈr5t"
In this case, the regex is built using paste(ref$diphthong, collapse="|")
and is just an alternation based pattern, ɪə|eɪ|ʊə|ɔɪ|aɪ|eə|aʊ|əʊ
. The ref$replacement[ref$diphthong==x]
part maps the diphthong found to its replacement value.
You can make a simple for loop:
for(i in seq_len(nrow(ref))) {
str <- gsub(ref$diphthong[i], ref$replacement[i], str)
}
it's enough:
stringr::str_replace_all(str, ref)
where ref
is defined as:
ref <- setNames(as.character(1:8), c("ɪə", "eɪ", "ʊə", "ɔɪ", "aɪ", "eə", "aʊ", "əʊ"))
In case ref
is already defined as a dataframe, you can just converted to a named vector in this way:
ref <- setNames(as.character(ref$replacement), ref$diphthong)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With