I have a big CSV file of doubles (10 million by 500) and I only want to read in a few thousand rows of this file (at various locations between 1 and 10 million), defined by a binary vector <code>V</code> of length 10 million, which assumes value <code>0</code> if I don't want to read the row and <code>1</code> if I do want to read the row. How do I get the io function <code>fread</code> from the <code>data.table</code> package to do this? I ask because <code>fread</code> is so so fast compared to all other io approaches. The best solution this question, Reading specific rows of large matrix data file, gives the following solution: <code>read.csv( pipe( paste0("sed -n '" , paste0( c( 1 , which( V == 1 ) + 1 ) , collapse = "p; " ) , "p' C:/Data/target.csv" , collapse = "" ) ) , head=TRUE)</code> where <code>C:/Data/target.csv</code> is the large CSV file and <code>V</code> is the vector of <code>0</code> or <code>1</code>. However I have noticed that this is orders of magnitude slower than simply using <code>fread</code> on the entire matrix, even if the <code>V</code> will only be equal to <code>1</code> for a small subset of the total number of rows. Thus, since <code>fread</code> on the whole matrix will dominate the above solution, how do I combine <code>fread</code> (and specifically <code>fread</code>) with row sampling? This is not a duplicate because it is only about the function <code>fread</code>. Here's my problem setup: <pre class="prettyprint"><code> #create csv csv <- do.call(rbind,lapply(1:50,function(i) { rnorm(5) })) #my csv has a header: colnames(csv) <- LETTERS[1:5] #save csv write.csv(csv,"/home/user/test_csv.csv",quote=FALSE,row.names=FALSE) #create vector of 0s and 1s that I want to read the CSV from read_vec <- rep(0,50) read_vec[c(1,5,29)] <- 1 #I only want to read in 1st,5th,29th rows #the following is the effect that I want, but I want an efficient approach to it: csv <- read.csv("/home/user/test_csv.csv") #inefficient! csv <- csv[which(read_vec==1),] #inefficient! #the alternative approach, too slow when scaled up! csv <- fread( pipe( paste0("sed -n '" , paste0( c( 1 , which( read_vec == 1 ) + 1 ) , collapse = "p; " ) , "p' /home/user/test_csv.csv" , collapse = "" ) ) , head=TRUE) #the fastest approach yet still not optimal because it needs to read all rows require(data.table) csv <- data.matrix(fread('/home/user/test_csv.csv')) csv <- csv[which(read_vec==1),] </code></pre>

This approach takes a vector <code>v</code> (corresponding to your <code>read_vec</code>), identifies sequences of rows to read, feeds those to sequential calls to <code>fread(...)</code>, and <code>rbinds</code> the result together. If the rows you want are randomly distributed throughout the file, this may not be faster. However, if the rows are in blocks (e.g., <code>c(1:50, 55, 70, 100:500, 700:1500)</code>) then there will be few calls to <code>fread(...)</code> and you may see a significant improvement. <pre class="prettyprint"><code># create sample dataset set.seed(1) m <- matrix(rnorm(1e5),ncol=10) csv <- data.frame(x=1:1e4,m) write.csv(csv,"test.csv") # s: rows we want to read s <- c(1:50,53, 65,77,90,100:200,350:500, 5000:6000) # v: logical, T means read this row (equivalent to your read_vec) v <- (1:1e4 %in% s) seq <- rle(v) idx <- c(0, cumsum(seq$lengths))[which(seq$values)] + 1 # indx: start = starting row of sequence, length = length of sequence (compare to s) indx <- data.frame(start=idx, length=seq$length[which(seq$values)]) library(data.table) result <- do.call(rbind,apply(indx,1, function(x) return(fread("test.csv",nrows=x[2],skip=x[1])))) </code></pre>

How to read specific rows of CSV file with fread function

Tags:

performance

io

r

csv

I have a big CSV file of doubles (10 million by 500) and I only want to read in a few thousand rows of this file (at various locations between 1 and 10 million), defined by a binary vector V of length 10 million, which assumes value 0 if I don't want to read the row and 1 if I do want to read the row.

How do I get the io function fread from the data.table package to do this? I ask because fread is so so fast compared to all other io approaches.

The best solution this question, Reading specific rows of large matrix data file, gives the following solution:

read.csv( pipe( paste0("sed -n '" , paste0( c( 1 , which( V == 1 ) + 1 ) , collapse = "p; " ) , "p' C:/Data/target.csv" , collapse = "" ) ) , head=TRUE)

where C:/Data/target.csv is the large CSV file and V is the vector of 0 or 1.

However I have noticed that this is orders of magnitude slower than simply using fread on the entire matrix, even if the V will only be equal to 1 for a small subset of the total number of rows.

Thus, since fread on the whole matrix will dominate the above solution, how do I combine fread (and specifically fread) with row sampling?

This is not a duplicate because it is only about the function fread.

Here's my problem setup:

 #create csv
 csv <- do.call(rbind,lapply(1:50,function(i) { rnorm(5) }))
 #my csv has a header:
 colnames(csv) <- LETTERS[1:5]
 #save csv
 write.csv(csv,"/home/user/test_csv.csv",quote=FALSE,row.names=FALSE)
 #create vector of 0s and 1s that I want to read the CSV from
 read_vec <- rep(0,50)
 read_vec[c(1,5,29)] <- 1 #I only want to read in 1st,5th,29th rows
 #the following is the effect that I want, but I want an efficient approach to it:
 csv <- read.csv("/home/user/test_csv.csv") #inefficient!
 csv <- csv[which(read_vec==1),] #inefficient!
 #the alternative approach, too slow when scaled up!
 csv <- fread( pipe( paste0("sed -n '" , paste0( c( 1 , which( read_vec == 1 ) + 1 ) , collapse = "p; " ) , "p' /home/user/test_csv.csv" , collapse = "" ) ) , head=TRUE)
 #the fastest approach yet still not optimal because it needs to read all rows
 require(data.table)
 csv <- data.matrix(fread('/home/user/test_csv.csv'))
 csv <- csv[which(read_vec==1),]

282

asked Feb 15 '14 14:02

user2763361

1 Answers

This approach takes a vector v (corresponding to your read_vec), identifies sequences of rows to read, feeds those to sequential calls to fread(...), and rbinds the result together.

If the rows you want are randomly distributed throughout the file, this may not be faster. However, if the rows are in blocks (e.g., c(1:50, 55, 70, 100:500, 700:1500)) then there will be few calls to fread(...) and you may see a significant improvement.

# create sample dataset
set.seed(1)
m   <- matrix(rnorm(1e5),ncol=10)
csv <- data.frame(x=1:1e4,m)
write.csv(csv,"test.csv")
# s: rows we want to read
s <- c(1:50,53, 65,77,90,100:200,350:500, 5000:6000)
# v: logical, T means read this row (equivalent to your read_vec)
v <- (1:1e4 %in% s)

seq  <- rle(v)
idx  <- c(0, cumsum(seq$lengths))[which(seq$values)] + 1
# indx: start = starting row of sequence, length = length of sequence (compare to s)
indx <- data.frame(start=idx, length=seq$length[which(seq$values)])

library(data.table)
result <- do.call(rbind,apply(indx,1, function(x) return(fread("test.csv",nrows=x[2],skip=x[1]))))

answered Nov 01 '22 14:11

jlhoward

Related questions
                            
                                How to find max in the list of data frames
                            
                                Applying colours other than blue to bin2d
                            
                                How does the dot metacharacter match newline characters?
                            
                                Delete rows in data frame if entry appears fewer than x times
                            
                                repeat same raster layer to create a raster stack
                            
                                Including all permutations when using data.table[,,by=...]
                            
                                How to check if a variable is categorical with R?
                            
                                Return multiple lists in mapply [duplicate]
                            
                                Why apply() does not work on my dataframe in R?
                            
                                Add dummies with conditions in data.table?
                            
                                Saving vectors of different lengths in a matrix/data frame
                            
                                Apply a custom function on an entire column of data.table?
                            
                                How to ==1 on a column defined by a variable
                            
                                How to get the intercept from a linear model with lasso (lars R package)
                            
                                R packages - should I import the `methods` package?
                            
                                In an R dataframe, how do I broadcast columns corresponding to dimensions?
                            
                                how to avoid overlapping labels with identical data points in scatterplot / ggplot?
                            
                                What Exactly are Anonymous Files
                            
                                AIC different between biglm and lm
                            
                                How to properly set contrasts in R

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With