R: Is it possible to parallelize / speed-up the reading in of a 20 million plus row CSV into R?

Tags:

Once the CSV is loaded via read.csv, it's fairly trivial to use multicore, segue etc to play around with the data in the CSV. Reading it in, however, is quite the time sink.

Realise it's better to use mySQL etc etc.

Assume the use of an AWS 8xl cluster compute instance running R2.13

Specs as follows:

Cluster Compute Eight Extra Large specifications:
88 EC2 Compute Units (Eight-core 2 x Intel Xeon)
60.5 GB of memory
3370 GB of instance storage
64-bit platform
I/O Performance: Very High (10 Gigabit Ethernet)

Any thoughts / ideas much appreciated.

445

asked Jan 30 '12 07:01

n.e.w

1 Answers

Going parallel might not be needed if you use fread in data.table.

library(data.table)
dt <- fread("myFile.csv")

A comment to this question illustrates its power. Also here's an example from my own experience:

d1 <- fread('Tr1PointData_ByTime_new.csv')
Read 1048575 rows and 5 (of 5) columns from 0.043 GB file in 00:00:09

I was able to read in 1.04 million rows in under 10s!

130

answered Sep 25 '22 21:09

Richard Erickson

Related questions
                            
                                Configuring codecov token in GitHub Actions .yaml for an R package
                            
                                Find the *first* longest sequence of TRUE in a boolean vector
                            
                                How to align a multiline title [duplicate]
                            
                                Speeding up Rcpp `anyNA` equivalent
                            
                                Determining if a matrix is diagonalizable in the R Programming Language
                            
                                R: RGraphviz installation
                            
                                How to insert a comma between each element in paste command in R?
                            
                                Data.frame becomes factor/vector after filtering/subsetting
                            
                                How do I connect to an Oracle Database in R?
                            
                                Fully qualified file name in R
                            
                                Reproducing the following base graph with ggplot2
                            
                                Referencing new columns inside transform
                            
                                Continuous and Dashed Lines using ggplot
                            
                                Automating great-circle map production in R
                            
                                Can the R console support background tasks or interrupts (event-handling)?
                            
                                How to get LaTeX table from latex() in Hmisc to align columns
                            
                                how to get the range (ylim) of a plot?
                            
                                How to extract counts as a vector from a table in R?
                            
                                Is there a module in Python that does something like "sqldf" for R?
                            
                                how to add functions in an existing environment

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

R: Is it possible to parallelize / speed-up the reading in of a 20 million plus row CSV into R?

Tags:

r

csv

parallel-processing

bigdata

n.e.w

People also ask

1 Answers

Richard Erickson

Recent Activity

Donate For Us