I am working on developing an application that ingests data from .csv's and then does some calculations to it. The challenge is that the .csv's can be very large in size. I have reviewed a number of posts here discussing the import of large .csv files using various functions & libraries. Some examples are below: <pre class="prettyprint"><code>### size of csv file: 689.4MB (7,009,728 rows * 29 columns) ### system.time(read.csv('../data/2008.csv', header = T)) # user system elapsed # 88.301 2.416 90.716 library(data.table) system.time(fread('../data/2008.csv', header = T, sep = ',')) # user system elapsed # 4.740 0.048 4.785 library(bigmemory) system.time(read.big.matrix('../data/2008.csv', header = T)) # user system elapsed # 59.544 0.764 60.308 library(ff) system.time(read.csv.ffdf(file = '../data/2008.csv', header = T)) # user system elapsed # 60.028 1.280 61.335 library(sqldf) system.time(read.csv.sql('../data/2008.csv')) # user system elapsed # 87.461 3.880 91.447 </code></pre> The challenge I am having is this. The .csv in question has headers in the second row and a first row that is filled with useless information. My initial approach (successfully applied to smaller files less than 5MB) was to used the following code for the import on smaller files after the first row was removed. <pre class="prettyprint"><code>report_query_X_all_content = readLines("C:/Users/.../report_queryX_XXX-XXX-XXXX.csv") skip_first = report_query_X_all_content[-1] report_query_X = read.csv(textConnection(skip_first), header = TRUE, stringsAsFactors = FALSE) </code></pre> Unfortunately, once the base file broke 70 or 80MB in size, the import time seems to increase exponentially. Most of the functions that I have been looking at, like fread(), required you to pass in the .csv directly. As you can see in my implementation, I passed in skip_first through textConnection after removing my desired row. The problem I am having is that, for 70 or 80MB files, there is a disproportionate lag in time. I started one import nearly 55 minutes ago and it is still running for a 79MB file. For context, skip_first is appearing in internal memory with a size of about 95MB. My next import is about 785MB. Does anyone have any suggestions or recommendations on how to accomplish what I am looking to do with larger data files. Eventually, this solution will be applied to .csv files that are as large as 1 - 4 GB in size & I am worried that the textConnection() step is causing a bottleneck.

Here is the solution that I ended up going with & which worked nicely: <pre class="prettyprint"><code>start_time <- Sys.time() # Calculate time diff on the big files library(bit64) report_query_X <- fread('C:/Users/.../report_queryX_XXX-XXX-XXXX.csv', skip = 1, sep = ",") end_time <- Sys.time() # Calculate time diff on the big files time_diff <- end_time - start_time # Calculate the time difference # time_diff = 1.068 seconds </code></pre> The total time taken for this implementation was 1.068 seconds for a 78.9MB file, which is excellent. Skip with fread() made a huge difference. I did get a warning message when I originally used fread(), noting that: <pre class="prettyprint"><code>Warning message: In fread("C:/Users/.../report_queryX_XXX-XXX-XXXX.csv", : Some columns have been read as type 'integer64' but package bit64 isn't loaded. Those columns will display as strange looking floating point data. There is no need to reload the data. Just require(bit64) to obtain the integer64 print method and print the data again. </code></pre> This is why I ended up installing bit64 using install.packages("bit64"), and then calling it using the library function; library(bit64) <hr> Edit: Note that I just tried using this call on a 251MB file and the total import time was 1.844106 secs.

Long lag time importing large .CSV's in R WITH header in second row

Tags:

r

csv

bigdata

I am working on developing an application that ingests data from .csv's and then does some calculations to it. The challenge is that the .csv's can be very large in size. I have reviewed a number of posts here discussing the import of large .csv files using various functions & libraries. Some examples are below:

### size of csv file: 689.4MB (7,009,728 rows * 29 columns) ###

system.time(read.csv('../data/2008.csv', header = T))
#   user  system elapsed 
# 88.301   2.416  90.716

library(data.table)
system.time(fread('../data/2008.csv', header = T, sep = ',')) 
#   user  system elapsed 
#  4.740   0.048   4.785

library(bigmemory)
system.time(read.big.matrix('../data/2008.csv', header = T))
#   user  system elapsed 
# 59.544   0.764  60.308

library(ff)
system.time(read.csv.ffdf(file = '../data/2008.csv', header = T))
#   user  system elapsed 
# 60.028   1.280  61.335 

library(sqldf)
system.time(read.csv.sql('../data/2008.csv'))
#   user  system elapsed 
# 87.461   3.880  91.447

The challenge I am having is this. The .csv in question has headers in the second row and a first row that is filled with useless information. My initial approach (successfully applied to smaller files less than 5MB) was to used the following code for the import on smaller files after the first row was removed.

report_query_X_all_content = readLines("C:/Users/.../report_queryX_XXX-XXX-XXXX.csv")
skip_first = report_query_X_all_content[-1]
report_query_X = read.csv(textConnection(skip_first), header = TRUE, stringsAsFactors = FALSE)

Unfortunately, once the base file broke 70 or 80MB in size, the import time seems to increase exponentially. Most of the functions that I have been looking at, like fread(), required you to pass in the .csv directly. As you can see in my implementation, I passed in skip_first through textConnection after removing my desired row. The problem I am having is that, for 70 or 80MB files, there is a disproportionate lag in time. I started one import nearly 55 minutes ago and it is still running for a 79MB file. For context, skip_first is appearing in internal memory with a size of about 95MB. My next import is about 785MB. Does anyone have any suggestions or recommendations on how to accomplish what I am looking to do with larger data files. Eventually, this solution will be applied to .csv files that are as large as 1 - 4 GB in size & I am worried that the textConnection() step is causing a bottleneck.

259

asked Jul 23 '14 21:07

Nathaniel Payne

1 Answers

Here is the solution that I ended up going with & which worked nicely:

start_time <- Sys.time() # Calculate time diff on the big files

library(bit64)

report_query_X <- fread('C:/Users/.../report_queryX_XXX-XXX-XXXX.csv', skip = 1, sep = ",")

end_time <- Sys.time() # Calculate time diff on the big files
time_diff <- end_time - start_time # Calculate the time difference
# time_diff = 1.068 seconds

The total time taken for this implementation was 1.068 seconds for a 78.9MB file, which is excellent. Skip with fread() made a huge difference. I did get a warning message when I originally used fread(), noting that:

Warning message:
In fread("C:/Users/.../report_queryX_XXX-XXX-XXXX.csv",  :
  Some columns have been read as type 'integer64' but package bit64 isn't loaded. Those columns will display as strange looking floating point data. There is no need to reload the data. Just require(bit64) to obtain the integer64 print method and print the data again.

This is why I ended up installing bit64 using install.packages("bit64"), and then calling it using the library function; library(bit64)

Edit: Note that I just tried using this call on a 251MB file and the total import time was 1.844106 secs.

103

answered Sep 29 '22 04:09

Nathaniel Payne

Related questions
                            
                                Finding euclidean distance in R{spatstat} between points, confined by an irregular polygon window
                            
                                R Shiny Handling - handling error from empty data frames
                            
                                How to fork processes in R
                            
                                How to save a VERY LARGE .rda file in R package
                            
                                What's happening when changing "class" of an S4 object using the class function?
                            
                                Backreferencing in R (Regular Expressions)
                            
                                Test if two `data.table`s point to the same memory location [duplicate]
                            
                                Computing pairwise distances between a set of intervals
                            
                                na.locf and inverse.rle in Rcpp
                            
                                Shiny rcharts size of the graph?
                            
                                Importing Sea Surface Temperature text files in ASCII format into R
                            
                                Problems in Knitting Html in preview release of RStudio
                            
                                Plotting group distances in R
                            
                                Load data object when package is loaded
                            
                                How can one set the size of an igraph plot?
                            
                                Equivalent of "xlim" or "ylim" for vertical/horizontal abline?
                            
                                Legend on survival plot
                            
                                Heatmap Transparency, Coloring and Specificity not Satisfying
                            
                                Read data from memory in Vowpal Wabbit?
                            
                                Shiny + ggplot: How to subset reactive data object?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Long lag time importing large .CSV's in R WITH header in second row

Tags:

r

csv

bigdata

Nathaniel Payne

People also ask

1 Answers

Nathaniel Payne

Recent Activity

Donate For Us