Replace specific chr values within groups for multiple variables in R

Tags:

r

1. Summarize the problem

Hi, I'm relatively new to R and this is my first question on stackoverflow but I’ve been learning from this site for a while. I found similar questions, but they explain how to remove missing values, work with numerical values or only work for a small number of IDs.

I have a large data frame (200 000+ rows) where one variable is an alphanumeric ID that represents unique candidates and other variables represent different characteristics. Some candidates are included multiple times in the file, but have different values for the same characteristic. I want to resolve these discrepancies to be able to remove duplicates later. The data structure is similar to this:

df <- tibble(ID = c("123abc", "123abc", "123abc", "456def", "456def", "789ghi"),
                 var1 = c("No", "Yes", "No", "No", "No", "No"),
                 var2 = c("No", "No", "No", "Yes", "No", "No"),
                 var3 = c("No", "No", "No", "No", "No", "Yes"))

My goal is to first create sub groups based on ID, then search within each ID to see if they have at least one value of “Yes”, and if so change all their values to “Yes”. I want to repeat this for a few variables (var1, var2, var3). This is the results that I would like to have:

df <- tibble(ID = c("123abc", "123abc", "123abc", "456def", "456def", "789ghi"),
              var1 = c("Yes", "Yes", "Yes", "No", "No", "No"),
              var2 = c("No", "No", "No", "Yes", "Yes", "No"),
              var3 = c("No", "No", "No", "No", "No", "Yes"))

After this, I will remove duplicate rows to only keep the data that I need.

df <- distinct(df, across(), .keep_all = TRUE)

2. Describe what you’ve tried

I found partial solutions but I’m having difficulty putting it together. I can regroup my data by ID using group_by from dplyr but I'm having issues applying my other functions to the groups:

df <- df %>% group_by(ID)

I can replace the “No” with “Yes” using the if combined with any, but without the groups, it changes all the values in var1:

if(any(df$var1 == "Yes"))
  {  df$var1 = "Yes"  }

The solution I'm trying to create would be similar to Creating loop for slicing the data, loop through the duplicated positions, by using for to loop the IDs and then the variables, but without replacing with random values.

298

asked Jun 17 '21 15:06

Max

1 Answers

I've promoted my comment to an answer to explain more.

First, we need to decide if we want to use dplyr::summarise or dplyr::mutate. summarise makes a single row for every group, whereas mutate leaves the data the same dimensions.

In your example data, all of the rows within each group will be the same after the transformation, so do you really need the duplicates? Perhaps your real data has other variables, so mutate might make sense.

From here, we just need to use dplyr::across to do the same action on each column. The first argument is to select the columns, and the second is the function you want to apply.

For mutate, we can use dplyr::ifelse to test if any variable is "Yes". If it is, we can repeat "Yes" as many times as there are rows in that group. Otherwise, we can leave the data alone. With across the data is represented by ..

df %>% 
  group_by(ID) %>%
  mutate(across(var1:var3, ~ ifelse(any(. == "Yes"),rep("Yes",length(.)),.)))
# A tibble: 6 x 4
# Groups:   ID [3]
  ID     var1  var2  var3 
  <chr>  <chr> <chr> <chr>
1 123abc Yes   No    No   
2 123abc Yes   No    No   
3 123abc Yes   No    No   
4 456def No    Yes   No   
5 456def No    Yes   No   
6 789ghi No    No    Yes

119

answered Sep 27 '22 15:09

Ian Campbell

Related questions
                            
                                match.call() returns a function or a symbol, but symbols can't be used by do.call()
                            
                                Could not find function "CreateSinglerObject"
                            
                                Compacting Shared Libraries in R package
                            
                                tidymodels: ranger with cross validation
                            
                                Combine rows that have common elements
                            
                                Is there a visual explanation of why data.table operations are faster than tidyverse operations when you need to group by a variable?
                            
                                Use both empty and string filters in dplyr's filter
                            
                                Returning a tibble: how to vectorize with case_when?
                            
                                Why does empty logical vector pass the stopifnot() check?
                            
                                ggplot: some Unicode shapes working while others do not
                            
                                ggplot heatmap gridline formatting geom_tile and geom_rect
                            
                                Reference problem in data.table following a copy
                            
                                Replace df <- df %>% ... with a shortcut
                            
                                How can I add an extra symbol in legend of a ggplot graph?
                            
                                How to use Monte Carlo for ARIMA Simulation Function in R
                            
                                Scrape site that asks for cookies consent with rvest
                            
                                Fast and efficient character DataFrame creation in Rcpp
                            
                                Convert English numbers to Persian numbers for ggplot
                            
                                R: How to truly remove an S4 slot from an S4 object (Solution attached!)
                            
                                How can I find how many locations near a radius of 250 meters

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Replace specific chr values within groups for multiple variables in R

Tags:

r

Max

People also ask

1 Answers

Ian Campbell

Recent Activity

Donate For Us