I wanted to know if there is a limit to the number of rows that can be read using the data.table fread function. I am working with a table with 4 billion rows, 4 columns, about 40 GB. It appears that fread will read only the first ~ 840 million rows. It does not give any errors but returns to the R prompt as if it had read all the data ! I understand that fread is not for "prod use" at the moment, and wanted to find out if there was any timeframe for implementation of a prod-release. The reason I am using data.table is that, for files of such sizes, it is extremely efficient at processing the data compared to loading the file in a data.frame, etc. At the moment, I am trying 2 other alternatives - 1) Using scan and passing on to a data.table <pre class="prettyprint"><code>data.table(matrix(scan("file.csv",what="integer",sep=","),ncol=4)) Resulted in -- Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings, : too many items </code></pre> 2) Breaking the file up into multiple individual segments with a limit of approx. 500 million rows using Unix split and reading them sequentially ... then looping over the files sequentially into fread - a bit cumbersome, but appears to be the only workable solution. I think there may be an Rcpp way to do this even faster, but am not sure how it is generally implemented. Thanks in advance.

I was able to accomplish this using feedback from another posting on Stackoverflow. The process was very fast and 40 GB of data was read in about 10 minutes using fread iteratively. Foreach-dopar failed to work when run by itself to read files into new data.tables sequentially due to some limitations which are also mentioned on the page below. Note: The file list (file_map) was prepared by simply running -- <pre class="prettyprint"><code>file_map <- list.files(pattern="test.$") # Replace pattern to suit your requirement </code></pre> mclapply with big objects - "serialization is too large to store in a raw vector" Quoting -- <pre class="prettyprint"><code>collector = vector("list", length(file_map)) # more complex than normal for speed for(index in 1:length(file_map)) { reduced_set <- mclapply(file_map[[index]], function(x) { on.exit(message(sprintf("Completed: %s", x))) message(sprintf("Started: '%s'", x)) fread(x) # <----- CHANGED THIS LINE to fread }, mc.cores=10) collector[[index]]= reduced_set } # Additional line (in place of rbind as in the URL above) for (i in 1:length(collector)) { rbindlist(list(finalList,yourFunction(collector[[i]][[1]]))) } # Replace yourFunction as needed, in my case it was an operation I performed on each segment and joined them with rbindlist at the end. </code></pre> My function included a loop using Foreach dopar that executed across several cores per file as specified in file_map. This allowed me to use dopar without encountering the "serialization too large error" when running on the combined file. Another helpful post is at -- loading files in parallel not working with foreach + data.table

Row limit for data.table in R using fread

Tags:

r

data.table

rcpp

I wanted to know if there is a limit to the number of rows that can be read using the data.table fread function. I am working with a table with 4 billion rows, 4 columns, about 40 GB. It appears that fread will read only the first ~ 840 million rows. It does not give any errors but returns to the R prompt as if it had read all the data !

I understand that fread is not for "prod use" at the moment, and wanted to find out if there was any timeframe for implementation of a prod-release.

The reason I am using data.table is that, for files of such sizes, it is extremely efficient at processing the data compared to loading the file in a data.frame, etc.

At the moment, I am trying 2 other alternatives -

1) Using scan and passing on to a data.table

data.table(matrix(scan("file.csv",what="integer",sep=","),ncol=4))

Resulted in --
Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings,  :
  too many items

2) Breaking the file up into multiple individual segments with a limit of approx. 500 million rows using Unix split and reading them sequentially ... then looping over the files sequentially into fread - a bit cumbersome, but appears to be the only workable solution.

I think there may be an Rcpp way to do this even faster, but am not sure how it is generally implemented.

Thanks in advance.

865

asked Jul 11 '13 14:07

xbsd

1 Answers

I was able to accomplish this using feedback from another posting on Stackoverflow. The process was very fast and 40 GB of data was read in about 10 minutes using fread iteratively. Foreach-dopar failed to work when run by itself to read files into new data.tables sequentially due to some limitations which are also mentioned on the page below.

Note: The file list (file_map) was prepared by simply running --

file_map <- list.files(pattern="test.$")  # Replace pattern to suit your requirement

mclapply with big objects - "serialization is too large to store in a raw vector"

Quoting --

collector = vector("list", length(file_map)) # more complex than normal for speed 

for(index in 1:length(file_map)) {
reduced_set <- mclapply(file_map[[index]], function(x) {
  on.exit(message(sprintf("Completed: %s", x)))
  message(sprintf("Started: '%s'", x))
  fread(x)             # <----- CHANGED THIS LINE to fread
}, mc.cores=10)
collector[[index]]= reduced_set

}

# Additional line (in place of rbind as in the URL above)

for (i in 1:length(collector)) { rbindlist(list(finalList,yourFunction(collector[[i]][[1]]))) }
# Replace yourFunction as needed, in my case it was an operation I performed on each segment and joined them with rbindlist at the end.

My function included a loop using Foreach dopar that executed across several cores per file as specified in file_map. This allowed me to use dopar without encountering the "serialization too large error" when running on the combined file.

Another helpful post is at -- loading files in parallel not working with foreach + data.table

117

answered Oct 03 '22 23:10

xbsd

Related questions
                            
                                R: selecting subset without copying
                            
                                Is there a _fast_ way to run a rolling regression inside data.table?
                            
                                R anchors in markdown
                            
                                postcode distances using google
                            
                                Slow memory leak in data.table when returning named lists in j (trying to reshape a data.table)
                            
                                Running a pre-build script when building an R package
                            
                                How do I write data from R to PostgreSQL tables with an autoincrementing primary key?
                            
                                R: Incompatible Dimensions Error vglm function in VGAM
                            
                                How do I read my Google Location History in R
                            
                                How to call R script from another R script, both in same package?
                            
                                R package does not load dependencies [duplicate]
                            
                                Line segments or rectangles with hover information in R plotly figure
                            
                                How can I print to the console when using knitr?
                            
                                Where does RStudio store the temporary R script files before saved?
                            
                                properly formatting a two-line caption in ggplot2
                            
                                Submit POST form when rvest doesn't recognize submit button
                            
                                Drawing rectangles on top of image R shiny
                            
                                Connect to MySQL database with RMySQL
                            
                                What does the lambda calculus have to say about return values?
                            
                                Is ggplot2's continuous color scale incompatible with knitr's tikzDevice?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Row limit for data.table in R using fread

Tags:

r

data.table

rcpp

xbsd

People also ask

1 Answers

xbsd

Recent Activity

Donate For Us