Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to read specific rows of CSV file with fread function

I have a big CSV file of doubles (10 million by 500) and I only want to read in a few thousand rows of this file (at various locations between 1 and 10 million), defined by a binary vector V of length 10 million, which assumes value 0 if I don't want to read the row and 1 if I do want to read the row.

How do I get the io function fread from the data.table package to do this? I ask because fread is so so fast compared to all other io approaches.

The best solution this question, Reading specific rows of large matrix data file, gives the following solution:

read.csv( pipe( paste0("sed -n '" , paste0( c( 1 , which( V == 1 ) + 1 ) , collapse = "p; " ) , "p' C:/Data/target.csv" , collapse = "" ) ) , head=TRUE)

where C:/Data/target.csv is the large CSV file and V is the vector of 0 or 1.

However I have noticed that this is orders of magnitude slower than simply using fread on the entire matrix, even if the V will only be equal to 1 for a small subset of the total number of rows.

Thus, since fread on the whole matrix will dominate the above solution, how do I combine fread (and specifically fread) with row sampling?

This is not a duplicate because it is only about the function fread.

Here's my problem setup:

 #create csv
 csv <- do.call(rbind,lapply(1:50,function(i) { rnorm(5) }))
 #my csv has a header:
 colnames(csv) <- LETTERS[1:5]
 #save csv
 write.csv(csv,"/home/user/test_csv.csv",quote=FALSE,row.names=FALSE)
 #create vector of 0s and 1s that I want to read the CSV from
 read_vec <- rep(0,50)
 read_vec[c(1,5,29)] <- 1 #I only want to read in 1st,5th,29th rows
 #the following is the effect that I want, but I want an efficient approach to it:
 csv <- read.csv("/home/user/test_csv.csv") #inefficient!
 csv <- csv[which(read_vec==1),] #inefficient!
 #the alternative approach, too slow when scaled up!
 csv <- fread( pipe( paste0("sed -n '" , paste0( c( 1 , which( read_vec == 1 ) + 1 ) , collapse = "p; " ) , "p' /home/user/test_csv.csv" , collapse = "" ) ) , head=TRUE)
 #the fastest approach yet still not optimal because it needs to read all rows
 require(data.table)
 csv <- data.matrix(fread('/home/user/test_csv.csv'))
 csv <- csv[which(read_vec==1),] 
like image 282
user2763361 Avatar asked Feb 15 '14 14:02

user2763361


People also ask

Can fread read CSV?

The benefit of reading the csv file with fread function is that there will be a variable added to the original csv file which contains the id as integers starting from 1 to the length of column values.

What is fread function in R?

Use the fread() Function in R As mentioned above, fread() is a faster way to read files, particularly large files. The good thing about this function is that it automatically detects column types and separators, which can also be specified manually.

How many bytes fread read?

fread reads exactly as many bytes as you tell it to (unless there are fewer bytes than that remaining in the file.) The middle two parameters of fread are the size of the object you want to read and the number of objects of that size you want to read.

Can fread read Xlsx?

The short answer is that you can't read xlsx files with fread , xlsx files have a very different format to the text files that fread is designed for.


1 Answers

This approach takes a vector v (corresponding to your read_vec), identifies sequences of rows to read, feeds those to sequential calls to fread(...), and rbinds the result together.

If the rows you want are randomly distributed throughout the file, this may not be faster. However, if the rows are in blocks (e.g., c(1:50, 55, 70, 100:500, 700:1500)) then there will be few calls to fread(...) and you may see a significant improvement.

# create sample dataset
set.seed(1)
m   <- matrix(rnorm(1e5),ncol=10)
csv <- data.frame(x=1:1e4,m)
write.csv(csv,"test.csv")
# s: rows we want to read
s <- c(1:50,53, 65,77,90,100:200,350:500, 5000:6000)
# v: logical, T means read this row (equivalent to your read_vec)
v <- (1:1e4 %in% s)

seq  <- rle(v)
idx  <- c(0, cumsum(seq$lengths))[which(seq$values)] + 1
# indx: start = starting row of sequence, length = length of sequence (compare to s)
indx <- data.frame(start=idx, length=seq$length[which(seq$values)])

library(data.table)
result <- do.call(rbind,apply(indx,1, function(x) return(fread("test.csv",nrows=x[2],skip=x[1]))))
like image 73
jlhoward Avatar answered Nov 01 '22 14:11

jlhoward