I want to remove duplicate rows from a dataframe, for specific columns only. That can be obtained with distinct
:
data <- tibble(a = c(1, 1, 2, 2), b = c(3, 3, 3, 4), z = c(5,4,5,5))
filtered_data <- data %>% distinct(a, b, .keep_all = T)
dim(filtered_data)
# [1] 3 3
This is (almost) what I need. Yet, my problem is that the columnnames I need to use with distinct
will change. So I have a string gen
that contains the names of the columns I want to use for with the distinct
function. They need to get unquoted to be usefull in the pipe. I found suggestions to use as.name()
or eval(parse())
. This however gives me a different result:
gen <- c("a", "b")
filtered_data <- data %>% distinct(eval(parse(text = gen)), .keep_all = T)
dim(filtered_data)
# [1] 2 4
The eval
seems to do something funny with the amount of times the data is filtered. (and, adds an extra column. I could live with that, though...) So, how to obtain a similar result, as if I had used a,b
, but by using a variable instead?
additional information
I actually obtain gen
by reading the columnnames of a dataframe: gen <- colnames(data)[1:2]
. The solution suggested by @gymbrane would be perfect, if I had a way to transform the gen
to c(a, b)
. The whole point is to avoid hardcoding the columnames. I tried things like gen <- noquotes(gen)
, which does not give an error in the rm_dup_rows
function suggested below, but it does give a different result, giving the same sort of repeated filtering as I started with...
fixed
I think I got it working. It might be unelegant, and I'm not sure if every step is necessary for the result, but it seems to work by combining the function provided by @gymbrane below with ensym
and quos
in a forloop while adding to a list in (edit: GlobalEnv isn't necessary):GlobalEnv
unquote_string <- function(string) {
out <- list()
i <- 1
for (s in string) {
t <- ensym(s)
out[i] <-dplyr::quos(!!t)
i <- i+1
}
return(out)
}
gen_quo <- unquote_string(gen)
filtered_data <- rm_dup_rows(data, gen_quo)
dim(filtered_data)
# [1] 3 3
How about creating a function and using quosures . Perhaps something like this is what you are looking for...
rm_dup_rows <- function(data, ...){
vars = dplyr::quos(...)
data %>% distinct(!!! vars, .keep_all = T)
}
I believe this returns what you are asking for
rm_dup_rows(data = data, a, b)
# A tibble: 3 x 3
a b z
<dbl> <dbl> <dbl>
1 3 5
2 3 5
2 4 5
rm_dup_rows(data, b, z)
# A tibble: 3 x 3
a b z
<dbl> <dbl> <dbl>
1 3 5
1 3 4
2 4 5
You could modify rm_dup_rows
just slightly and construct and your vector with quos
. Something like this...
rm_dup_rows <- function(data, vars){
data %>% distinct(!!! vars, .keep_all = T)
}
# quos your column name vector
gen <- quos(a,z)
rm_dup_rows(data, gen)
# A tibble: 3 x 3
a b z
<dbl> <dbl> <dbl>
1 3 5
1 3 4
2 3 5
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With