Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Deleting reversed duplicates with R

I have a data frame in R that contains the gene ids of paralogous genes in Arabidopsis, looking something like this:

gene_x    gene_y
AT1       AT2
AT3       AT4
AT1       AT2
AT1       AT3
AT2       AT1

with the 'ATx' corresponding to the gene names.

Now, for downstream analysis, I would want to continue only with the unique pairs. Some pairs are just simple duplicates and can be removed easily upon using the duplicated() function. However, the fifth row in the artificial data frame above is also a duplicate, but in reversed order, and which will not be picked up by the duplicated(), nor by the unique() function.

Any ideas in how to remove these rows?

like image 218
KoenVdB Avatar asked Mar 31 '14 08:03

KoenVdB


People also ask

How do I remove specific duplicates in R?

Use the unique() function to remove duplicates from the selected columns of the R data frame.

How do I remove duplicates from multiple columns in R?

distinct() function can be used to filter out the duplicate rows. We just have to pass our R object and the column name as an argument in the distinct() function.

How do I subset duplicates in R?

We can find the rows with duplicated values in a particular column of an R data frame by using duplicated function inside the subset function. This will return only the duplicate rows based on the column we choose that means the first unique value will not be in the output.


2 Answers

A dplyr possibility could be:

mydf %>%
 group_by(grp = paste(pmax(gene_x, gene_y), pmin(gene_x, gene_y), sep = "_")) %>%
 slice(1) %>%
 ungroup() %>%
 select(-grp)

  gene_x gene_y
  <chr>  <chr> 
1 AT1    AT2   
2 AT1    AT3   
3 AT3    AT4  

Or:

mydf %>%
 group_by(grp = paste(pmax(gene_x, gene_y), pmin(gene_x, gene_y), sep = "_")) %>%
 filter(row_number() == 1) %>%
 ungroup() %>%
 select(-grp)

Or:

mydf %>%
 group_by(grp = paste(pmax(gene_x, gene_y), pmin(gene_x, gene_y), sep = "_")) %>%
 distinct(grp, .keep_all = TRUE) %>%
 ungroup() %>%
 select(-grp)

Or using dplyr and purrr:

mydf %>%
 group_by(grp = paste(invoke(pmax, .), invoke(pmin, .), sep = "_")) %>%
 slice(1) %>%
 ungroup() %>%
 select(-grp)

And as of purrr 0.3.0 invoke() is retired, exec() should be used instead:

mydf %>%
 group_by(grp = paste(exec(pmax, !!!.), exec(pmin, !!!.), sep = "_")) %>%
 slice(1) %>%
 ungroup() %>%
 select(-grp)

Or:

df %>%
 rowwise() %>%
 mutate(grp = paste(sort(c(gene_x, gene_y)), collapse = "_")) %>%
 group_by(grp) %>%
 slice(1) %>%
 ungroup() %>%
 select(-grp)
like image 198
tmfmnk Avatar answered Oct 05 '22 00:10

tmfmnk


Another tidyverse-centric approach but using purrr:

library(tidyverse)

c_sort_collapse <- function(...){
  c(...) %>% 
    sort() %>% 
    str_c(collapse = ".")
}

mydf %>% 
  mutate(x_y = map2_chr(gene_x, gene_y, c_sort_collapse)) %>% 
  distinct(x_y, .keep_all = TRUE) %>% 
  select(-x_y)
#>   gene_x gene_y
#> 1    AT1    AT2
#> 2    AT3    AT4
#> 3    AT1    AT3
like image 31
Bryan Shalloway Avatar answered Oct 05 '22 01:10

Bryan Shalloway