Deleting reversed duplicates with R

Tags:

I have a data frame in R that contains the gene ids of paralogous genes in Arabidopsis, looking something like this:

gene_x    gene_y
AT1       AT2
AT3       AT4
AT1       AT2
AT1       AT3
AT2       AT1

with the 'ATx' corresponding to the gene names.

Now, for downstream analysis, I would want to continue only with the unique pairs. Some pairs are just simple duplicates and can be removed easily upon using the duplicated() function. However, the fifth row in the artificial data frame above is also a duplicate, but in reversed order, and which will not be picked up by the duplicated(), nor by the unique() function.

Any ideas in how to remove these rows?

218

asked Mar 31 '14 08:03

KoenVdB

2 Answers

A dplyr possibility could be:

mydf %>%
 group_by(grp = paste(pmax(gene_x, gene_y), pmin(gene_x, gene_y), sep = "_")) %>%
 slice(1) %>%
 ungroup() %>%
 select(-grp)

  gene_x gene_y
  <chr>  <chr> 
1 AT1    AT2   
2 AT1    AT3   
3 AT3    AT4

Or:

mydf %>%
 group_by(grp = paste(pmax(gene_x, gene_y), pmin(gene_x, gene_y), sep = "_")) %>%
 filter(row_number() == 1) %>%
 ungroup() %>%
 select(-grp)

Or:

mydf %>%
 group_by(grp = paste(pmax(gene_x, gene_y), pmin(gene_x, gene_y), sep = "_")) %>%
 distinct(grp, .keep_all = TRUE) %>%
 ungroup() %>%
 select(-grp)

Or using dplyr and purrr:

mydf %>%
 group_by(grp = paste(invoke(pmax, .), invoke(pmin, .), sep = "_")) %>%
 slice(1) %>%
 ungroup() %>%
 select(-grp)

And as of purrr 0.3.0 invoke() is retired, exec() should be used instead:

mydf %>%
 group_by(grp = paste(exec(pmax, !!!.), exec(pmin, !!!.), sep = "_")) %>%
 slice(1) %>%
 ungroup() %>%
 select(-grp)

Or:

df %>%
 rowwise() %>%
 mutate(grp = paste(sort(c(gene_x, gene_y)), collapse = "_")) %>%
 group_by(grp) %>%
 slice(1) %>%
 ungroup() %>%
 select(-grp)

198

answered Oct 05 '22 00:10

tmfmnk

Another tidyverse-centric approach but using purrr:

library(tidyverse)

c_sort_collapse <- function(...){
  c(...) %>% 
    sort() %>% 
    str_c(collapse = ".")
}

mydf %>% 
  mutate(x_y = map2_chr(gene_x, gene_y, c_sort_collapse)) %>% 
  distinct(x_y, .keep_all = TRUE) %>% 
  select(-x_y)
#>   gene_x gene_y
#> 1    AT1    AT2
#> 2    AT3    AT4
#> 3    AT1    AT3

answered Oct 05 '22 01:10

Bryan Shalloway

Related questions
                            
                                Run R command from command line
                            
                                Quickly Write Vector to File r
                            
                                Distances of points between rows with sf
                            
                                Set the right crs on sf object to plot coordinate points
                            
                                Create counter for runs of TRUE among FALSE and NA, by group
                            
                                Using R to Analyze Balance Sheets and Income Statements
                            
                                What is the simplest method to fill the area under a geom_freqpoly line?
                            
                                R ggplot ordering bars in "barplot-like " plot
                            
                                How can read 'Numeral Signs-#' as part of a column header?
                            
                                How would you write a wrapper function or class to format numbers as percent, currency, etc. in R?
                            
                                Is there a way to limit vline lengths in ggplot2
                            
                                Legend for summary statistics in ggplot2
                            
                                Why is it slower to prespecify type in a data.frame?
                            
                                Ghost factor levels in R [duplicate]
                            
                                Splitting a string by space except when contained within quotes
                            
                                Change grid line behavior in ggplot2
                            
                                Remove whiskers in box-whisker-plot
                            
                                knitr templates and child documents in a loop
                            
                                R: ifelse function returns vector position instead of value (string)
                            
                                Output a good-looking matrix using renderTable()

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Deleting reversed duplicates with R

Tags:

string

dataframe

r

KoenVdB

People also ask

2 Answers

tmfmnk

Bryan Shalloway

Recent Activity

Donate For Us