1. Summarize the problem
Hi, I'm relatively new to R
and this is my first question on stackoverflow but I’ve been learning from this site for a while. I found similar questions, but they explain how to remove missing values, work with numerical values or only work for a small number of IDs.
I have a large data frame (200 000+ rows) where one variable is an alphanumeric ID that represents unique candidates and other variables represent different characteristics. Some candidates are included multiple times in the file, but have different values for the same characteristic. I want to resolve these discrepancies to be able to remove duplicates later. The data structure is similar to this:
df <- tibble(ID = c("123abc", "123abc", "123abc", "456def", "456def", "789ghi"),
var1 = c("No", "Yes", "No", "No", "No", "No"),
var2 = c("No", "No", "No", "Yes", "No", "No"),
var3 = c("No", "No", "No", "No", "No", "Yes"))
My goal is to first create sub groups based on ID, then search within each ID to see if they have at least one value of “Yes”, and if so change all their values to “Yes”. I want to repeat this for a few variables (var1, var2, var3). This is the results that I would like to have:
df <- tibble(ID = c("123abc", "123abc", "123abc", "456def", "456def", "789ghi"),
var1 = c("Yes", "Yes", "Yes", "No", "No", "No"),
var2 = c("No", "No", "No", "Yes", "Yes", "No"),
var3 = c("No", "No", "No", "No", "No", "Yes"))
After this, I will remove duplicate rows to only keep the data that I need.
df <- distinct(df, across(), .keep_all = TRUE)
2. Describe what you’ve tried
I found partial solutions but I’m having difficulty putting it together. I can regroup my data by ID using group_by
from dplyr
but I'm having issues applying my other functions to the groups:
df <- df %>% group_by(ID)
I can replace the “No” with “Yes” using the if
combined with any
, but without the groups, it changes all the values in var1:
if(any(df$var1 == "Yes"))
{ df$var1 = "Yes" }
The solution I'm trying to create would be similar to Creating loop for slicing the data, loop through the duplicated positions, by using for
to loop the IDs and then the variables, but without replacing with random values.
To replace a column value in R use square bracket notation df[] , By using this you can update values on a single column or on all columns. To refer to a single column use df$column_name .
Replace column values based on checking logical conditions in R DataFrame is pretty straightforward. All you need to do is select the column vector you wanted to update and use the condition within [] . The following example demonstrates how to update DataFrame column values by checking conditions on a numeric column.
How to perform a group by on multiple columns in R DataFrame? By using group_by() function from dplyr package we can perform group by on multiple columns or variables (two or more columns) and summarise on multiple columns for aggregations.
I've promoted my comment to an answer to explain more.
First, we need to decide if we want to use dplyr::summarise
or dplyr::mutate
. summarise
makes a single row for every group, whereas mutate
leaves the data the same dimensions.
In your example data, all of the rows within each group will be the same after the transformation, so do you really need the duplicates? Perhaps your real data has other variables, so mutate
might make sense.
From here, we just need to use dplyr::across
to do the same action on each column. The first argument is to select the columns, and the second is the function you want to apply.
For mutate, we can use dplyr::ifelse
to test if any variable is "Yes"
. If it is, we can repeat "Yes"
as many times as there are rows in that group. Otherwise, we can leave the data alone. With across
the data is represented by .
.
df %>%
group_by(ID) %>%
mutate(across(var1:var3, ~ ifelse(any(. == "Yes"),rep("Yes",length(.)),.)))
# A tibble: 6 x 4
# Groups: ID [3]
ID var1 var2 var3
<chr> <chr> <chr> <chr>
1 123abc Yes No No
2 123abc Yes No No
3 123abc Yes No No
4 456def No Yes No
5 456def No Yes No
6 789ghi No No Yes
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With