Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Long lag time importing large .CSV's in R WITH header in second row

Tags:

r

csv

bigdata

I am working on developing an application that ingests data from .csv's and then does some calculations to it. The challenge is that the .csv's can be very large in size. I have reviewed a number of posts here discussing the import of large .csv files using various functions & libraries. Some examples are below:

### size of csv file: 689.4MB (7,009,728 rows * 29 columns) ###

system.time(read.csv('../data/2008.csv', header = T))
#   user  system elapsed 
# 88.301   2.416  90.716

library(data.table)
system.time(fread('../data/2008.csv', header = T, sep = ',')) 
#   user  system elapsed 
#  4.740   0.048   4.785

library(bigmemory)
system.time(read.big.matrix('../data/2008.csv', header = T))
#   user  system elapsed 
# 59.544   0.764  60.308

library(ff)
system.time(read.csv.ffdf(file = '../data/2008.csv', header = T))
#   user  system elapsed 
# 60.028   1.280  61.335 

library(sqldf)
system.time(read.csv.sql('../data/2008.csv'))
#   user  system elapsed 
# 87.461   3.880  91.447

The challenge I am having is this. The .csv in question has headers in the second row and a first row that is filled with useless information. My initial approach (successfully applied to smaller files less than 5MB) was to used the following code for the import on smaller files after the first row was removed.

report_query_X_all_content = readLines("C:/Users/.../report_queryX_XXX-XXX-XXXX.csv")
skip_first = report_query_X_all_content[-1]
report_query_X = read.csv(textConnection(skip_first), header = TRUE, stringsAsFactors = FALSE)

Unfortunately, once the base file broke 70 or 80MB in size, the import time seems to increase exponentially. Most of the functions that I have been looking at, like fread(), required you to pass in the .csv directly. As you can see in my implementation, I passed in skip_first through textConnection after removing my desired row. The problem I am having is that, for 70 or 80MB files, there is a disproportionate lag in time. I started one import nearly 55 minutes ago and it is still running for a 79MB file. For context, skip_first is appearing in internal memory with a size of about 95MB. My next import is about 785MB. Does anyone have any suggestions or recommendations on how to accomplish what I am looking to do with larger data files. Eventually, this solution will be applied to .csv files that are as large as 1 - 4 GB in size & I am worried that the textConnection() step is causing a bottleneck.

like image 259
Nathaniel Payne Avatar asked Jul 23 '14 21:07

Nathaniel Payne


People also ask

How do I read a large CSV file in R?

Loading a large dataset: use fread() or functions from readr instead of read. xxx() . If you really need to read an entire csv in memory, by default, R users use the read. table method or variations thereof (such as read.

How big is too large for CSV?

Essentially, there are two options: Split the CSV file into multiple smaller files that do fit within the 1,048,576 row limit; or, Find an Excel add-in that supports CSV files with a higher number of rows.

How many rows is too many for CSV?

csv files have a limit of 32,767 characters per cell. Excel has a limit of 1,048,576 rows and 16,384 columns per sheet. CSV files can hold many more rows. You can read more about these limits and others from this Microsoft support article here.


1 Answers

Here is the solution that I ended up going with & which worked nicely:

start_time <- Sys.time() # Calculate time diff on the big files

library(bit64)

report_query_X <- fread('C:/Users/.../report_queryX_XXX-XXX-XXXX.csv', skip = 1, sep = ",")

end_time <- Sys.time() # Calculate time diff on the big files
time_diff <- end_time - start_time # Calculate the time difference
# time_diff = 1.068 seconds

The total time taken for this implementation was 1.068 seconds for a 78.9MB file, which is excellent. Skip with fread() made a huge difference. I did get a warning message when I originally used fread(), noting that:

Warning message:
In fread("C:/Users/.../report_queryX_XXX-XXX-XXXX.csv",  :
  Some columns have been read as type 'integer64' but package bit64 isn't loaded. Those columns will display as strange looking floating point data. There is no need to reload the data. Just require(bit64) to obtain the integer64 print method and print the data again.

This is why I ended up installing bit64 using install.packages("bit64"), and then calling it using the library function; library(bit64)


Edit: Note that I just tried using this call on a 251MB file and the total import time was 1.844106 secs.

like image 103
Nathaniel Payne Avatar answered Sep 29 '22 04:09

Nathaniel Payne