Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Reading in chunks at a time using fread in package data.table

I'm trying to input a large tab-delimited file (around 2GB) using the fread function in package data.table. However, because it's so large, it doesn't fit completely in memory. I tried to input it in chunks by using the skip and nrow arguments such as:

chunk.size = 1e6
done = FALSE
chunk = 1
while(!done)
{
    temp = fread("myfile.txt",skip=(chunk-1)*chunk.size,nrow=chunk.size-1)
    #do something to temp
    chunk = chunk + 1
    if(nrow(temp)<2) done = TRUE
}

In the case above, I'm reading in 1 million rows at a time, performing a calculation on them, and then getting the next million, etc. The problem with this code is that after every chunk is retrieved, fread needs to start scanning the file from the very beginning since after every loop iteration, skip increases by a million. As a result, after every chunk, fread takes longer and longer to actually get to the next chunk making this very inefficient.

Is there a way to tell fread to pause every say 1 million lines, and then continue reading from that point on without having to restart at the beginning? Any solutions, or should this be a new feature request?

like image 861
FBC Avatar asked Nov 10 '13 19:11

FBC


4 Answers

You should use the LaF package. This introduces a sort of pointer on your data, thus avoiding the - for very large data - annoying behaviour of reading the whole file. As far as I get it fread() in data.table pckg need to know total number of rows, which takes time for GB data. Using pointer in LaF you can go to every line(s) you want; and read in chunks of data that you can apply your function on, then move on to next chunk of data. On my small PC I ran trough a 25 GB csv-file in steps of 10e6 lines and extracted the totally ~5e6 observations needed - each 10e6 chunk took 30 seconds.

UPDATE:

library('LaF') huge_file <- 'C:/datasets/protein.links.v9.1.txt'  #First detect a data model for your file: model <- detect_dm_csv(huge_file, sep=" ", header=TRUE) 

Then create a connection to your file using the model:

df.laf <- laf_open(model) 

Once done you can do all sort of things without needing to know the size of the file as in data.table pckgs. For instance place the pointer to line no 100e6 and read 1e6 lines of data from here:

goto(df.laf, 100e6) data <- next_block(df.laf,nrows=1e6) 

Now data contains 1e6 lines of your CSV file (starting from line 100e6).

You can read in chunks of data (size depending on your memory) and only keep what you need. e.g. the huge_file in my example points to a file with all known protein sequences and has a size of >27 GB - way to big for my PC. To get only human sequence I filtered using organism id which is 9606 for human, and this should appear in start of the variable protein1. A dirty way is to put it into a simple for-loop and just go read one data chunk at a time:

library('dplyr') library('stringr')  res <- df.laf[1,][0,] for(i in 1:10){   raw <-     next_block(df.laf,nrows=100e6) %>%      filter(str_detect(protein1,"^9606\\."))   res <- rbind(res, raw)      } 

Now res contains the filtered human data. But better - and for more complex operations, e.g. calculation on data on-the-fly - the function process_blocks() takes as argument a function. Hence in the function you do what ever you want at each piece of data. Read the documentation.

like image 146
user3375672 Avatar answered Oct 01 '22 22:10

user3375672


You can use readr's read_*_chunked to read in data and e.g. filter it chunkwise. See here and here for an example:

# Cars with 3 gears f <- function(x, pos) subset(x, gear == 3) read_csv_chunked(readr_example("mtcars.csv"), DataFrameCallback$new(f), chunk_size = 5) 
like image 31
Rentrop Avatar answered Oct 01 '22 22:10

Rentrop


A related option is the chunked package. Here is an example with a 3.5 GB text file:

library(chunked)
library(tidyverse)

# I want to look at the daily page views of Wikipedia articles
# before 2015... I can get zipped log files
# from here: hhttps://dumps.wikimedia.org/other/pagecounts-ez/merged/2012/2012-12/
# I get bz file, unzip to get this: 

my_file <- 'pagecounts-2012-12-14/pagecounts-2012-12-14'

# How big is my file?
print(paste(round(file.info(my_file)$size  / 2^30,3), 'gigabytes'))
# [1] "3.493 gigabytes" too big to open in Notepad++ !
# But can read with 010 Editor

# look at the top of the file 
readLines(my_file, n = 100)

# to find where the content starts, vary the skip value, 
read.table(my_file, nrows = 10, skip = 25)

This is where we start working in chunks of the file, we can use most dplyr verbs in the usual way:

# Let the chunked pkg work its magic! We only want the lines containing 
# "Gun_control". The main challenge here was identifying the column
# header
df <- 
read_chunkwise(my_file, 
               chunk_size=5000,
               skip = 30,
               format = "table",
               header = TRUE) %>% 
  filter(stringr::str_detect(De.mw.De.5.J3M1O1, "Gun_control"))

# this line does the evaluation, 
# and takes a few moments...
system.time(out <- collect(df))

And here we can work on the output as usual, since it's much smaller than the input file:

# clean up the output to separate into cols, 
# and get the number of page views as a numeric
out_df <- 
out %>% 
  separate(De.mw.De.5.J3M1O1, 
           into = str_glue("V{1:4}"),
           sep = " ") %>% 
  mutate(V3 = as.numeric(V3))

 head(out_df)
    V1                                                        V2   V3
1 en.z                                               Gun_control 7961
2 en.z Category:Gun_control_advocacy_groups_in_the_United_States 1396
3 en.z          Gun_control_policy_of_the_Clinton_Administration  223
4 en.z                            Category:Gun_control_advocates   80
5 en.z                         Gun_control_in_the_United_Kingdom   68
6 en.z                                    Gun_control_in_america   59
                                                                                 V4
1 A34B55C32D38E32F32G32H20I22J9K12L10M9N15O34P38Q37R83S197T1207U1643V1523W1528X1319
2                                     B1C5D2E1F3H3J1O1P3Q9R9S23T197U327V245W271X295
3                                     A3B2C4D2E3F3G1J3K1L1O3P2Q2R4S2T24U39V41W43X40
4                                                            D2H1M1S4T8U22V10W18X14
5                                                             B1C1S1T11U12V13W16X13
6                                                         B1H1M1N2P1S1T6U5V17W12X12

#--------------------
like image 20
Ben Avatar answered Oct 01 '22 22:10

Ben


fread() can definitely help you read the data by chunks

What mistake you have made in your code is that you should keep your nrow a constant while you change the size of your skip parameter in the function during the loop.

Something like this is what I wrote for my data:

data=NULL

for (i in 0:20){
    data[[i+1]]=fread("my_data.csv",nrow=10000,select=c(1,2:100),skip =10000*i)   
}

And you may insert the follow code in your loop:

start_time <- Sys.time()
#####something!!!!

end_time <- Sys.time()

end_time - start_time

to check the time -- that each loop on average takes similar time.

Then you could use another loop to combine your data by rows with function default rbind function in R.

The sample code could be something like this:

new_data = data[[1]]

for (i in 1:20){
    new_data=rbind(new_data,data[[i+1]],use.names=FALSE)
}

to unify into a large dataset.

Hope my answer may help with your question.

I loaded a 18Gb data with 2k+ columns, 200k rows in about 8 minutes using this method.

like image 43
Michael Miau Avatar answered Oct 01 '22 23:10

Michael Miau