Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Replace multiple words in R easily; str_replace_all gives error that two objects are not equal lengths

I'm trying to use str_replace_all to replace many different values (i.e. "Mod", "M2", "M3", "Interviewer") with one the consistent string (i.e. "Moderator:"). I'm doing this with multiple different categories, and I want avoid having to write each unique value out as there are a lot.

So I made a tibble consisting of all the unique values that I want to make standardized and read it in and then pulled out each column (there are 5 but only 2 shown for simplicity) to make them into vectors:

speak_names <- read_csv("speak_names.csv")
speak_namesMisc <- dplyr::pull(speak_names, Misc)
speak_namesMod <- dplyr::pull(speak_names, Moderator)

For the replacement value, I made a character vector of equal length to those above vectors because I know that the replacement and pattern must be equal lengths:

Misc <- rep("Misc:", 2)
Mod <- rep("Moderator:", 28)

When I run Misc through with this code, it works just fine:

atas_clean$speaker <- str_replace_all(atas_clean$speaker, speak_namesMisc, Misc)

But when I try the identical Moderator version (even if I attempt to run it before Misc), I get an error message:

atas_clean$speaker <- str_replace_all(atas_clean$speaker, speak_namesMod, 
Mod)

Warning message:
In stri_replace_all_regex(string, pattern, fix_replacement(replacement),  :
longer object length is not a multiple of shorter object length

I don't know why I'm getting this error because this identical function yields TRUE:

identical(length(speak_namesMod), length(Mod))

The dataframe that I'm working with is 16,244 lines long if that makes any difference to the pattern or replacement. I'm stuck and trying to find out why this isn't working and/or another solution that does not involve typing out each character element in the vectors.

Thank you!

like image 726
J.Sabree Avatar asked Jan 02 '23 06:01

J.Sabree


1 Answers

library('dplyr') # load the dplyr package
library('stringr') # load the stringr package

#Here is a sample of my own dataset to answer your question dput() of my data gives

abc<-as.data.frame(
structure(list(Name = c("ME-9_ 005", "ME-9_ 004", "ME-9_ 003", 
                        "ME-9_ 002", "ME-9_ 001", "ME-9_ 000", "ME-8_ 005", "ME-8_ 004", 
                        "ME-8_ 003", "ME-8_ 002", "ME-8_ 001", "ME-8_ 000", "ME-7_ 005", 
                        "ME-7_ 004", "ME-7_ 003", "ME-7_ 002", "ME-7_ 001", "ME-7_ 000"
), Mg = c(0.411058647473409, 0.361611969040526, 0.435757145931429, 
          0.36656632349025, 0.312782034685408, 0.357913661160629, 0.414639893651842, 
          0.460992875568015, 0.554803107534663, 0.418743792959099, 0.499114614445091, 
          0.475374442706501, 0.564660334010035, 0.502678818989733, 0.417617035801997, 
          0.488463005872639, 0.484776757286094, 0.424850010858818),
Al = c(0.575667101719941,  0.586351493923602, 0.574053324307634, 0.628497798862674, 0.552234153060378, 
       0.580547408629286, 1.05746950789483, 1.07094531357244, 1.11340157804305, 
       1.03043684466386, 1.02899468191215, 1.07222457991059, 1.5276908007952, 
       1.66549994904359, 1.43287302441973, 1.37434198093964, 1.55835986529032, 
       1.66902429579112), 
Si = c(0.495188340689301, 0.513374456164654, 
       0.51809643007659, 0.569128515813393, 0.542590350648068, 0.516673370168739, 
       1.72437228079744, 1.59076392020817, 1.77327433861292, 1.76671780355934, 
       1.60625706442694, 1.92449284567535, 3.27248599245035, 3.23739024834759, 
       2.84115179036218, 2.51112086010829, 2.98829002803169, 2.93347114563903
), 
P = c(0.222881184902066, 0.258237982165306, 0.230235867213535, 
      0.262379290809071, 0.230438623604524, 0.238615393939999, 0.260241811918024, 
      0.238785817517132, 0.248589968755681, 0.248270048794532, 0.272489046130942, 
      0.266707140244041, 0.25935282543278, 0.258801008935983, 0.250692297246152, 
      0.246890941447243, 0.277698144829677, 0.274197618349091)), 
row.names = c(NA, 
              -18L), class = c("tbl_df", "tbl", "data.frame")))

#here is how my data looked before cleaning

head(abc,10)

Before

But for your specific question, you should do

abc$Name <- str_replace_all(
  abc$Name, # column we want to search
  c("001" = "","002" = "","003" = "","004" = "","005" = "","000" = "",
    "-" = " ","_" = "") # each string schould be matched with a replacement
)

#here is how my data looked after cleaning

head(abc,10)

After

I hope this helps

like image 100
Hammao Avatar answered Jan 04 '23 19:01

Hammao