Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Conditional replacement if the "correct" value exists

Tags:

r

duplicates

My data consists of two variables, an id and a corresponding name. The name can be two things. Either the id or a string of letters.

If there exists a non-numeric name, I need to replace any numeric names with this value.

Data example

df <- data.frame(id = c("100", "100", "101", "102", "103", "104", "104", "105", "100", "106"), 
             name = c("100", "A", "B", "C", "D", "104", "E", "F", "100", "106"), 
             correct_name = c("A", "A", "B", "C", "D", "E", "E", "F", "A", "106"), stringsAsFactors = F)

The third column gives the desired result.

I've been messing around with %in% and duplicated and group_by, but been unable to get anywhere.

EDIT: I missed a crucial part - there can be instances of a character name not existing. Updated the example - sorry!

like image 661
Thorst Avatar asked Apr 09 '26 18:04

Thorst


2 Answers

EDIT

Since you have mentioned that there are certain id with no name to replace in such cases we can modify the ave option, check the condition and replace the values all in one call.

df$name <- with(df, ave(name, id, FUN = function(x) {
   inds = grepl("[0-9]+", x)
   if (any(!inds)) 
    replace(x, inds, x[which.max(!inds)])
   else
    x
}))

df
#    id name correct_name
#1  100    A            A
#2  100    A            A
#3  101    B            B
#4  102    C            C
#5  103    D            D
#6  104    E            E
#7  104    E            E
#8  105    F            F
#9  100    A            A
#10 106  106          106

Original Answer

Assuming every id would have only one unique name, using dplyr we can do double replace first we change the names which has a number in it to NA and then replace those NAs with the first non-NA value in the group.

library(dplyr)

df %>%
  group_by(id) %>%
  mutate(name = replace(name, grepl("[0-9]+", name), NA), 
         name = replace(name, is.na(name), name[!is.na(name)][1]))

#  id   name  correct_name
#  <chr> <chr> <chr>       
#1 100   A     A           
#2 100   A     A           
#3 101   B     B           
#4 102   C     C           
#5 103   D     D           
#6 104   E     E           
#7 104   E     E           
#8 105   F     F           
#9 100   A     A      

And using the same logic with base R ave

#Replace the numbers with NA
df$name[grepl("[0-9]+", df$name)] <- NA

#Change the NA's to first non-NA value in the group
df$name <- with(df,ave(name, id, FUN = function(x) x[!is.na(x)][1]))

Another option is to use tidyr fill in both the directions

library(tidyverse)
df %>%
  mutate(name = replace(name, grepl("[0-9]+", name), NA)) %>%
  group_by(id) %>%
  fill(name) %>%  #default direction is "down"
  fill(name, .direction = "up")

#  id    name  correct_name
#  <chr> <chr> <chr>       
#1 100   A     A           
#2 100   A     A           
#3 100   A     A           
#4 101   B     B           
#5 102   C     C           
#6 103   D     D           
#7 104   E     E           
#8 104   E     E           
#9 105   F     F   

PS - I just added stringsAsFactors = FALSE in your data.frame call to make the columns as character.

like image 74
Ronak Shah Avatar answered Apr 11 '26 06:04

Ronak Shah


A solution with dplyr and the use of ifelse plus grepl with the pattern set to "\\d+" (ie: digits).

Edit: it's possible to have just one mutate:

df %>% 
  group_by(id) %>% 
  mutate(namenew = ifelse(
    grepl("\\d+", name),   # match for digits in the string
    name[!grepl("\\d+", name)][1], # if TRUE, substitute with the first non-digit
    name # if FALSE, keep it
  )) 
#    id name correct_name namenew
# 1 100  100            A       A
# 2 100    A            A       A
# 3 101    B            B       B
# 4 102    C            C       C
# 5 103    D            D       D
# 6 104  104            E       A
# 7 104    E            E       E
# 8 105    F            F       F
# 9 100  100            A       A

Maybe more clear of what's happening compared to my solution above. (Similar to @Ronak Shah)

library(dplyr)
df %>% 
  group_by(id) %>%
  mutate(namenew = ifelse(
    grepl("\\d+", name), 
    NA,
    name
  )) %>% 
  mutate(namenew = ifelse(
    is.na(namenew),
    namenew[!is.na(namenew)][1],
    namenew
  ))


#    id name correct_name namenew
# 1 100  100            A       A
# 2 100    A            A       A
# 3 101    B            B       B
# 4 102    C            C       C
# 5 103    D            D       D
# 6 104  104            E       A
# 7 104    E            E       E
# 8 105    F            F       F
# 9 100  100            A       A

Data (stringsAsFactors is important):

df <- data.frame(id = c("100", "100", "101", "102", "103", "104", "104", "105", "100"), 
                 name = c("100", "A", "B", "C", "D", "104", "E", "F", "100"), 
                 correct_name = c("A", "A", "B", "C", "D", "E", "E", "F", "A"), stringsAsFactors = F)
like image 27
RLave Avatar answered Apr 11 '26 06:04

RLave



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!