Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

subset a column in data frame based on another data frame/list

Tags:

r

apply

subset

I have the following table1 which is a data frame composed of 6 columns and 8083 rows. Below I am displaying the head of this table1:

|gene ID        |   prom_65|   prom_66|  amast_69|  amast_70|   p_value|
|:--------------|---------:|---------:|---------:|---------:|---------:|
|LdBPK_321470.1 |   24.7361|   25.2550|   31.2974|   45.4209| 0.2997430|
|LdBPK_251900.1 |  107.3580|  112.9870|   77.4182|   86.3211| 0.0367792|
|LdBPK_331430.1 |   72.0639|   86.1486|   68.5747|   77.8383| 0.2469355|
|LdBPK_100640.1 |   43.8766|   53.4004|   34.0255|   38.4038| 0.1299948|
|LdBPK_330360.1 | 2382.8700| 1871.9300| 2013.4200| 2482.0600| 0.8466225|
|LdBPK_090870.1 |   49.6488|   53.7134|   59.1175|   66.0931| 0.0843242|

I have another data frame, called accessions40 which is a list of 510 gene IDs. It is a subset of the first column of table1 i.e. all of its values (510) are contained in the first column of table1 (8083). The head of accessions40 is displayed below:

|V1             |
|:--------------|
|LdBPK_330360.1 |
|LdBPK_283000.1 |
|LdBPK_360210.1 |
|LdBPK_261550.1 |
|LdBPK_367320.1 |
|LdBPK_361420.1 |

What I want to do is the following: I want to produce a new table2 which contains under the first column (gene ID) only the values present in accessions40 and the corresponding values from the other five columns from table1. In other words, I want to subset the first column of my table1 based on the values of accessions40.

like image 994
BCArg Avatar asked Aug 09 '16 12:08

BCArg


People also ask

How do I subset data based on column values in R?

By using R base df[] notation, or subset() you can easily subset the R Data Frame (data. frame) by column value or by column name.

How do you subset a Dataframe based on columns in R?

The most general way to subset a data frame by rows and/or columns is the base R Extract[] function, indicated by matched square brackets instead of the usual matched parentheses. For a data frame named d the general format is d[rows, columms] .


2 Answers

We can use %in% to get a logical vector and subset the rows of the 'table1' based on that.

subset(table1, gene_ID %in% accessions40$V1)

A better option would be data.table

library(data.table)
setDT(table1)[gene_ID %chin% accessions40$V1]

Or use filter from dplyr

library(dplyr)
table1 %>%
      filter(gene_ID %in% accessions40$V1)
like image 62
akrun Avatar answered Oct 25 '22 21:10

akrun


There are many ways to do this. Finding the gene_ID in table1 which are present in V1 column of accession40

table1[table1$gene_ID %in% accessions40$V1, ]

Or you can also use match

table1[match(accessions40$V1, table1$gene_ID), ]
like image 11
Ronak Shah Avatar answered Oct 25 '22 21:10

Ronak Shah