Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Keep common rows among groups based on a column in dplyr

My data frame looks like this

df <- data.frame(gene=c("A","B","C","A","B","D"), 
                 origin=rep(c("old","new"),each=3),
                 value=sample(rnorm(10,2),6))

  gene origin     value
1    A    old 1.5566908
2    B    old 1.3000358
3    C    old 0.7668213
4    A    new 2.5274712
5    B    new 2.2434525
6    D    new 2.0758326

I want to find the common genes between the two different groups of origin (old and new)

I want my data to look like this

  gene origin     value
1    A    old 1.5566908
2    B    old 1.3000358
4    A    new 2.5274712
5    B    new 2.2434525

Any help is appreciated. Ideally I would like to find common rows among groups using multiple columns

like image 509
LDT Avatar asked Aug 02 '21 13:08

LDT


People also ask

How do I select a row based on a column value in R?

By using bracket notation on R DataFrame (data.name) we can select rows by column value, by index, by name, by condition e.t.c. You can also use the R base function subset() to get the same results. Besides these, R also provides another function dplyr::filter() to get the rows from the DataFrame.

How do you subset rows in R dplyr?

In order to Filter or subset rows in R we will be using Dplyr package. Dplyr package in R is provided with filter() function which subsets the rows with multiple conditions on different criteria. We will be using mtcars data to depict the example of filtering or subsetting. Filter or subset the rows in R using dplyr.

What is the difference between the Group_by and filter function?

group_by() takes an existing tbl and converts it into a grouped tbl where operations are performed "by group". ungroup() removes grouping. The filter() function is used to subset a data frame, retaining all rows that satisfy your conditions. To be retained, the row must produce a value of TRUE for all conditions.


4 Answers

A base R option using ave + subset

subset(
  df,
  as.logical(ave(origin,gene,FUN = function(x) all(c("old","new")%in% x)))
)

gives

  gene origin     value
1    A    old 0.5994593
2    B    old 4.0449345
4    A    new 3.2478612
5    B    new 0.2673525
like image 102
ThomasIsCoding Avatar answered Oct 22 '22 09:10

ThomasIsCoding


You can use split and reduce to get the common genes and use it in filter.

library(dplyr)
library(purrr)

df %>% filter(gene %in% (split(df$gene, df$origin) %>% reduce(intersect)))

#  gene origin value
#1    A    old 1.271
#2    B    old 2.838
#3    A    new 0.974
#4    B    new 1.375

Or keeping in base R -

subset(df, gene %in% Reduce(intersect, split(df$gene, df$origin)))
like image 41
Ronak Shah Avatar answered Oct 22 '22 08:10

Ronak Shah


One possibility could be:

df %>%
    group_by(gene) %>%
    filter(all(c("old", "new") %in% origin))

  gene  origin value
  <chr> <chr>  <dbl>
1 A     old    1.63 
2 B     old    0.904
3 A     new    2.18 
4 B     new    1.24 
like image 3
tmfmnk Avatar answered Oct 22 '22 09:10

tmfmnk


I would filter according to duplicates, and scan it from last and first.

library(tidyverse)

df %>% filter(
        duplicated(gene, fromLast = TRUE) | duplicated(gene, fromLast = FALSE)
)
  gene origin    value
1    A    old 2.665606
2    B    old 1.565466
3    A    new 4.025450
4    B    new 2.647110

Note: I cant replicate your data as you didnt provide a seed!

like image 3
Serkan Avatar answered Oct 22 '22 10:10

Serkan