Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

R: Is there a way to subset a file while reading

Tags:

r

csv

I have a huge .csv file, its size is ~ 1.4G and reading with read.csv takes time. There are several variables in that file and all i want is to extract data for few variables in a certain column.

For example, suppose ABC.csv is my file and it looks something like this:

   ABC.csv
     Date       Variables   Val
   2017-11-01   X           23  
   2017-11-01   A           2
   2017-11-01   B           0.5
   ............................
   2017-11-02   X           20
   2017-11-02   C           40
   ............................
   2017-11-03   D           33
   2017-11-03   X           22   
   ............................
   ............................

So , here the variable of interest is X and while reading this file i want the df$Variables to be scanned reading only the rows with X string in this column. So that my new data from will look something like this:

 > df 
  Date    Variables   Val
2017-11-01    X       23
2017-11-02    X       20
.........................
......................... 

Any Help will be appreciated. Thank you in advance.

like image 586
Shreta Ghimire Avatar asked Nov 17 '25 22:11

Shreta Ghimire


2 Answers

Check out the LaF package, it allows to read very large textfiles in blocks, so you don't have to read the entire file into memory.

library(LaF)

data_model <- detect_dm_csv("yourFile.csv", skip = 1) # detects the file structure
dat <- laf_open(data_model) # opens connection to the file

block_list <- lapply(seq(1,100000,1000), function(row_num){
    goto(dat, row_num)
    data_block <- next_block(dat, nrows = 1000) # reads data blocks of 1000 rows
    data_block <- data_block[data_block$Variables == "X",]
    return(data_block)
})
your_df <- do.call("rbind", block_list)

Admittedly, the package sometimes feels a bit bulky and in some situations I had to find small hacks to get my results (you might have to adapt my solution for your data). Nevertheless, I found it a immensely useful solution for dealing with files that exceeded my RAM.

like image 182
tobiasegli_te Avatar answered Nov 19 '25 13:11

tobiasegli_te


Just wondering if doing this works. It worked for my code but I am not sure whether it is first reading in the entire data and then subsetting or is it only reading the part of the file where Variables == 'X'.

temp <- fread('dat.csv')[Variables == 'X']
like image 25
89_Simple Avatar answered Nov 19 '25 12:11

89_Simple



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!