Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Replace specific chr values within groups for multiple variables in R

Tags:

r

1. Summarize the problem

Hi, I'm relatively new to R and this is my first question on stackoverflow but I’ve been learning from this site for a while. I found similar questions, but they explain how to remove missing values, work with numerical values or only work for a small number of IDs.

I have a large data frame (200 000+ rows) where one variable is an alphanumeric ID that represents unique candidates and other variables represent different characteristics. Some candidates are included multiple times in the file, but have different values for the same characteristic. I want to resolve these discrepancies to be able to remove duplicates later. The data structure is similar to this:

df <- tibble(ID = c("123abc", "123abc", "123abc", "456def", "456def", "789ghi"),
                 var1 = c("No", "Yes", "No", "No", "No", "No"),
                 var2 = c("No", "No", "No", "Yes", "No", "No"),
                 var3 = c("No", "No", "No", "No", "No", "Yes"))

My goal is to first create sub groups based on ID, then search within each ID to see if they have at least one value of “Yes”, and if so change all their values to “Yes”. I want to repeat this for a few variables (var1, var2, var3). This is the results that I would like to have:

df <- tibble(ID = c("123abc", "123abc", "123abc", "456def", "456def", "789ghi"),
              var1 = c("Yes", "Yes", "Yes", "No", "No", "No"),
              var2 = c("No", "No", "No", "Yes", "Yes", "No"),
              var3 = c("No", "No", "No", "No", "No", "Yes"))

After this, I will remove duplicate rows to only keep the data that I need.

df <- distinct(df, across(), .keep_all = TRUE)

2. Describe what you’ve tried

I found partial solutions but I’m having difficulty putting it together. I can regroup my data by ID using group_by from dplyr but I'm having issues applying my other functions to the groups:

df <- df %>% group_by(ID)

I can replace the “No” with “Yes” using the if combined with any, but without the groups, it changes all the values in var1:

if(any(df$var1 == "Yes"))
  {  df$var1 = "Yes"  }

The solution I'm trying to create would be similar to Creating loop for slicing the data, loop through the duplicated positions, by using for to loop the IDs and then the variables, but without replacing with random values.

like image 298
Max Avatar asked Jun 17 '21 15:06

Max


People also ask

How do I change specific data in R?

To replace a column value in R use square bracket notation df[] , By using this you can update values on a single column or on all columns. To refer to a single column use df$column_name .

How do I change data based on conditions in R?

Replace column values based on checking logical conditions in R DataFrame is pretty straightforward. All you need to do is select the column vector you wanted to update and use the condition within [] . The following example demonstrates how to update DataFrame column values by checking conditions on a numeric column.

Can I group by multiple variables in R?

How to perform a group by on multiple columns in R DataFrame? By using group_by() function from dplyr package we can perform group by on multiple columns or variables (two or more columns) and summarise on multiple columns for aggregations.


1 Answers

I've promoted my comment to an answer to explain more.

First, we need to decide if we want to use dplyr::summarise or dplyr::mutate. summarise makes a single row for every group, whereas mutate leaves the data the same dimensions.

In your example data, all of the rows within each group will be the same after the transformation, so do you really need the duplicates? Perhaps your real data has other variables, so mutate might make sense.

From here, we just need to use dplyr::across to do the same action on each column. The first argument is to select the columns, and the second is the function you want to apply.

For mutate, we can use dplyr::ifelse to test if any variable is "Yes". If it is, we can repeat "Yes" as many times as there are rows in that group. Otherwise, we can leave the data alone. With across the data is represented by ..

df %>% 
  group_by(ID) %>%
  mutate(across(var1:var3, ~ ifelse(any(. == "Yes"),rep("Yes",length(.)),.)))
# A tibble: 6 x 4
# Groups:   ID [3]
  ID     var1  var2  var3 
  <chr>  <chr> <chr> <chr>
1 123abc Yes   No    No   
2 123abc Yes   No    No   
3 123abc Yes   No    No   
4 456def No    Yes   No   
5 456def No    Yes   No   
6 789ghi No    No    Yes  
like image 119
Ian Campbell Avatar answered Sep 27 '22 15:09

Ian Campbell