I am working on developing an application that ingests data from .csv's and then does some calculations to it. The challenge is that the .csv's can be very large in size. I have reviewed a number of posts here discussing the import of large .csv files using various functions & libraries. Some examples are below:
### size of csv file: 689.4MB (7,009,728 rows * 29 columns) ###
system.time(read.csv('../data/2008.csv', header = T))
# user system elapsed
# 88.301 2.416 90.716
library(data.table)
system.time(fread('../data/2008.csv', header = T, sep = ','))
# user system elapsed
# 4.740 0.048 4.785
library(bigmemory)
system.time(read.big.matrix('../data/2008.csv', header = T))
# user system elapsed
# 59.544 0.764 60.308
library(ff)
system.time(read.csv.ffdf(file = '../data/2008.csv', header = T))
# user system elapsed
# 60.028 1.280 61.335
library(sqldf)
system.time(read.csv.sql('../data/2008.csv'))
# user system elapsed
# 87.461 3.880 91.447
The challenge I am having is this. The .csv in question has headers in the second row and a first row that is filled with useless information. My initial approach (successfully applied to smaller files less than 5MB) was to used the following code for the import on smaller files after the first row was removed.
report_query_X_all_content = readLines("C:/Users/.../report_queryX_XXX-XXX-XXXX.csv")
skip_first = report_query_X_all_content[-1]
report_query_X = read.csv(textConnection(skip_first), header = TRUE, stringsAsFactors = FALSE)
Unfortunately, once the base file broke 70 or 80MB in size, the import time seems to increase exponentially. Most of the functions that I have been looking at, like fread(), required you to pass in the .csv directly. As you can see in my implementation, I passed in skip_first through textConnection after removing my desired row. The problem I am having is that, for 70 or 80MB files, there is a disproportionate lag in time. I started one import nearly 55 minutes ago and it is still running for a 79MB file. For context, skip_first is appearing in internal memory with a size of about 95MB. My next import is about 785MB. Does anyone have any suggestions or recommendations on how to accomplish what I am looking to do with larger data files. Eventually, this solution will be applied to .csv files that are as large as 1 - 4 GB in size & I am worried that the textConnection() step is causing a bottleneck.
Loading a large dataset: use fread() or functions from readr instead of read. xxx() . If you really need to read an entire csv in memory, by default, R users use the read. table method or variations thereof (such as read.
Essentially, there are two options: Split the CSV file into multiple smaller files that do fit within the 1,048,576 row limit; or, Find an Excel add-in that supports CSV files with a higher number of rows.
csv files have a limit of 32,767 characters per cell. Excel has a limit of 1,048,576 rows and 16,384 columns per sheet. CSV files can hold many more rows. You can read more about these limits and others from this Microsoft support article here.
Here is the solution that I ended up going with & which worked nicely:
start_time <- Sys.time() # Calculate time diff on the big files
library(bit64)
report_query_X <- fread('C:/Users/.../report_queryX_XXX-XXX-XXXX.csv', skip = 1, sep = ",")
end_time <- Sys.time() # Calculate time diff on the big files
time_diff <- end_time - start_time # Calculate the time difference
# time_diff = 1.068 seconds
The total time taken for this implementation was 1.068 seconds for a 78.9MB file, which is excellent. Skip with fread() made a huge difference. I did get a warning message when I originally used fread(), noting that:
Warning message:
In fread("C:/Users/.../report_queryX_XXX-XXX-XXXX.csv", :
Some columns have been read as type 'integer64' but package bit64 isn't loaded. Those columns will display as strange looking floating point data. There is no need to reload the data. Just require(bit64) to obtain the integer64 print method and print the data again.
This is why I ended up installing bit64 using install.packages("bit64"), and then calling it using the library function; library(bit64)
Edit: Note that I just tried using this call on a 251MB file and the total import time was 1.844106 secs.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With