I have a data frame in R that contains the gene ids of paralogous genes in Arabidopsis, looking something like this:
gene_x gene_y
AT1 AT2
AT3 AT4
AT1 AT2
AT1 AT3
AT2 AT1
with the 'ATx' corresponding to the gene names.
Now, for downstream analysis, I would want to continue only with the unique pairs. Some pairs are just simple duplicates and can be removed easily upon using the duplicated()
function.
However, the fifth row in the artificial data frame above is also a duplicate, but in reversed order, and which will not be picked up by the duplicated()
, nor by the unique()
function.
Any ideas in how to remove these rows?
Use the unique() function to remove duplicates from the selected columns of the R data frame.
distinct() function can be used to filter out the duplicate rows. We just have to pass our R object and the column name as an argument in the distinct() function.
We can find the rows with duplicated values in a particular column of an R data frame by using duplicated function inside the subset function. This will return only the duplicate rows based on the column we choose that means the first unique value will not be in the output.
A dplyr
possibility could be:
mydf %>%
group_by(grp = paste(pmax(gene_x, gene_y), pmin(gene_x, gene_y), sep = "_")) %>%
slice(1) %>%
ungroup() %>%
select(-grp)
gene_x gene_y
<chr> <chr>
1 AT1 AT2
2 AT1 AT3
3 AT3 AT4
Or:
mydf %>%
group_by(grp = paste(pmax(gene_x, gene_y), pmin(gene_x, gene_y), sep = "_")) %>%
filter(row_number() == 1) %>%
ungroup() %>%
select(-grp)
Or:
mydf %>%
group_by(grp = paste(pmax(gene_x, gene_y), pmin(gene_x, gene_y), sep = "_")) %>%
distinct(grp, .keep_all = TRUE) %>%
ungroup() %>%
select(-grp)
Or using dplyr
and purrr
:
mydf %>%
group_by(grp = paste(invoke(pmax, .), invoke(pmin, .), sep = "_")) %>%
slice(1) %>%
ungroup() %>%
select(-grp)
And as of purrr 0.3.0
invoke()
is retired, exec()
should be used instead:
mydf %>%
group_by(grp = paste(exec(pmax, !!!.), exec(pmin, !!!.), sep = "_")) %>%
slice(1) %>%
ungroup() %>%
select(-grp)
Or:
df %>%
rowwise() %>%
mutate(grp = paste(sort(c(gene_x, gene_y)), collapse = "_")) %>%
group_by(grp) %>%
slice(1) %>%
ungroup() %>%
select(-grp)
Another tidyverse-centric approach but using purrr
:
library(tidyverse)
c_sort_collapse <- function(...){
c(...) %>%
sort() %>%
str_c(collapse = ".")
}
mydf %>%
mutate(x_y = map2_chr(gene_x, gene_y, c_sort_collapse)) %>%
distinct(x_y, .keep_all = TRUE) %>%
select(-x_y)
#> gene_x gene_y
#> 1 AT1 AT2
#> 2 AT3 AT4
#> 3 AT1 AT3
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With