I am looking for a fast way to extract a large number of rows from an even larger table. The top of my table is as follows:
> head(dbsnp)
snp gene distance
rs5 rs5 KRIT1 1
rs6 rs6 CYP51A1 1
rs7 rs7 LOC401387 1
rs8 rs8 CDK6 1
rs9 rs9 CDK6 1
rs10 rs10 CDK6 1
And the dimensions:
> dim(dbsnp)
[1] 11934948 3
I want to select the rows that have the rownames contained in a list:
> head(features)
[1] "rs1367830" "rs5915027" "rs2060113" "rs1594503" "rs1116848" "rs1835693"
> length(features)
[1] 915635
Not surprisingly, the straightforward way of doing this temptable = dbsnp[features,]
takes quite a long time.
I've been looking into ways to do this through the sqldf package in R. I thought that that might be faster. Unfortunately, I can't figure out how to select rows with certain rownames in SQL.
Thanks.
In this article, we will learn how to select columns and rows from a data frame in R. We start by selecting a specific column. Similar to lists, we can use the double bracket [ []] operator to select a column. This will return a vector data type. If we want to select a column and return a data frame, we can use the single bracket notation.
In order to get the multiple rows by name use the vector with the values, you wanted to return. Let’s see some examples of how to select rows by conditions in R, for example, conditions include equal, not equal. And also some examples to get rows based on multiple conditions. To get rows based on column value use %in% operator. 3.
The following code shows how to select rows where the value in a certain column belongs to a list of values: #select rows where team is equal to 'A' or 'C' df [df$team %in% c ('A', 'C'), ]
If we want to select a column and return a data frame, we can use the single bracket notation. We can also pass a vector of positions to select multiple columns. Since a data frame is a super powered matrix, R also let's us use matrix selection notation.
The data.table
solution:
library(data.table)
dbsnp <- structure(list(snp = c("rs5", "rs6", "rs7", "rs8", "rs9", "rs10"
), gene = c("KRIT1", "CYP51A1", "LOC401387", "CDK6", "CDK6",
"CDK6"), distance = c(1L, 1L, 1L, 1L, 1L, 1L)), .Names = c("snp",
"gene", "distance"), class = "data.frame", row.names = c("rs5",
"rs6", "rs7", "rs8", "rs9", "rs10"))
DT <- data.table(dbsnp, key='snp')
features <- c('rs5', 'rs7', 'rs9')
DT[features]
snp gene distance
1: rs5 KRIT1 1
2: rs7 LOC401387 1
3: rs9 CDK6 1
Using sqldf
you will need rownames = TRUE
then you can query on the rownames using row_names
:
library(sqldf)
## input
test<-read.table(header=T,text=" snp gene distance
rs5 rs5 KRIT1 1
rs6 rs6 CYP51A1 1
rs7 rs7 LOC401387 1
rs8 rs8 CDK6 1
rs9 rs9 CDK6 1
rs10 rs10 CDK6 1
")
features<-c("rs5","rs7","rs10")
## calculate
inVar <- toString(shQuote(features, type = "csh")) # 'rs5','rs7','rs10'
fn$sqldf("SELECT * FROM test t
WHERE t.row_names IN ($inVar)"
, row.names = TRUE)
## result
# snp gene distance
#rs5 rs5 KRIT1 1
#rs7 rs7 LOC401387 1
#rs10 rs10 CDK6 1
UPDATE: Alternately if fet
is a data frame whose features
column contains the required items to find:
fet <- data.frame(features)
sqldf("SELECT t.* FROM test t
WHERE t.row_names IN (SELECT features FROM fet)"
, row.names = TRUE)
Also if the data were sufficiently large we could speed it up using indexes. See the sqldf home page for this and other details.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With