I have a big CSV file of doubles (10 million by 500) and I only want to read in a few thousand rows of this file (at various locations between 1 and 10 million), defined by a binary vector V
of length 10 million, which assumes value 0
if I don't want to read the row and 1
if I do want to read the row.
How do I get the io function fread
from the data.table
package to do this? I ask because fread
is so so fast compared to all other io approaches.
The best solution this question, Reading specific rows of large matrix data file, gives the following solution:
read.csv( pipe( paste0("sed -n '" , paste0( c( 1 , which( V == 1 ) + 1 ) , collapse = "p; " ) , "p' C:/Data/target.csv" , collapse = "" ) ) , head=TRUE)
where C:/Data/target.csv
is the large CSV file and V
is the vector of 0
or 1
.
However I have noticed that this is orders of magnitude slower than simply using fread
on the entire matrix, even if the V
will only be equal to 1
for a small subset of the total number of rows.
Thus, since fread
on the whole matrix will dominate the above solution, how do I combine fread
(and specifically fread
) with row sampling?
This is not a duplicate because it is only about the function fread
.
Here's my problem setup:
#create csv
csv <- do.call(rbind,lapply(1:50,function(i) { rnorm(5) }))
#my csv has a header:
colnames(csv) <- LETTERS[1:5]
#save csv
write.csv(csv,"/home/user/test_csv.csv",quote=FALSE,row.names=FALSE)
#create vector of 0s and 1s that I want to read the CSV from
read_vec <- rep(0,50)
read_vec[c(1,5,29)] <- 1 #I only want to read in 1st,5th,29th rows
#the following is the effect that I want, but I want an efficient approach to it:
csv <- read.csv("/home/user/test_csv.csv") #inefficient!
csv <- csv[which(read_vec==1),] #inefficient!
#the alternative approach, too slow when scaled up!
csv <- fread( pipe( paste0("sed -n '" , paste0( c( 1 , which( read_vec == 1 ) + 1 ) , collapse = "p; " ) , "p' /home/user/test_csv.csv" , collapse = "" ) ) , head=TRUE)
#the fastest approach yet still not optimal because it needs to read all rows
require(data.table)
csv <- data.matrix(fread('/home/user/test_csv.csv'))
csv <- csv[which(read_vec==1),]
The benefit of reading the csv file with fread function is that there will be a variable added to the original csv file which contains the id as integers starting from 1 to the length of column values.
Use the fread() Function in R As mentioned above, fread() is a faster way to read files, particularly large files. The good thing about this function is that it automatically detects column types and separators, which can also be specified manually.
fread reads exactly as many bytes as you tell it to (unless there are fewer bytes than that remaining in the file.) The middle two parameters of fread are the size of the object you want to read and the number of objects of that size you want to read.
The short answer is that you can't read xlsx files with fread , xlsx files have a very different format to the text files that fread is designed for.
This approach takes a vector v
(corresponding to your read_vec
), identifies sequences of rows to read, feeds those to sequential calls to fread(...)
, and rbinds
the result together.
If the rows you want are randomly distributed throughout the file, this may not be faster. However, if the rows are in blocks (e.g., c(1:50, 55, 70, 100:500, 700:1500)
) then there will be few calls to fread(...)
and you may see a significant improvement.
# create sample dataset
set.seed(1)
m <- matrix(rnorm(1e5),ncol=10)
csv <- data.frame(x=1:1e4,m)
write.csv(csv,"test.csv")
# s: rows we want to read
s <- c(1:50,53, 65,77,90,100:200,350:500, 5000:6000)
# v: logical, T means read this row (equivalent to your read_vec)
v <- (1:1e4 %in% s)
seq <- rle(v)
idx <- c(0, cumsum(seq$lengths))[which(seq$values)] + 1
# indx: start = starting row of sequence, length = length of sequence (compare to s)
indx <- data.frame(start=idx, length=seq$length[which(seq$values)])
library(data.table)
result <- do.call(rbind,apply(indx,1, function(x) return(fread("test.csv",nrows=x[2],skip=x[1]))))
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With