Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

R: Is it possible to parallelize / speed-up the reading in of a 20 million plus row CSV into R?

Once the CSV is loaded via read.csv, it's fairly trivial to use multicore, segue etc to play around with the data in the CSV. Reading it in, however, is quite the time sink.

Realise it's better to use mySQL etc etc.

Assume the use of an AWS 8xl cluster compute instance running R2.13

Specs as follows:

Cluster Compute Eight Extra Large specifications:
88 EC2 Compute Units (Eight-core 2 x Intel Xeon)
60.5 GB of memory
3370 GB of instance storage
64-bit platform
I/O Performance: Very High (10 Gigabit Ethernet)

Any thoughts / ideas much appreciated.

like image 445
n.e.w Avatar asked Jan 30 '12 07:01

n.e.w


People also ask

How do I read a large csv file in R?

Method 3: Using fread() method If the CSV files are extremely large, the best way to import into R is using the fread() method from the data. table package. The output of the data will be in the form of Data table in this case.

How do I read a large file in R?

Loading a large dataset: use fread() or functions from readr instead of read. xxx() . If you really need to read an entire csv in memory, by default, R users use the read. table method or variations thereof (such as read.

Is read table faster than read CSV?

Compare the Read Times table package is around 40 times faster than the base package and 8.5 times faster than the read_csv from the readr package.


1 Answers

Going parallel might not be needed if you use fread in data.table.

library(data.table)
dt <- fread("myFile.csv")

A comment to this question illustrates its power. Also here's an example from my own experience:

d1 <- fread('Tr1PointData_ByTime_new.csv')
Read 1048575 rows and 5 (of 5) columns from 0.043 GB file in 00:00:09

I was able to read in 1.04 million rows in under 10s!

like image 130
Richard Erickson Avatar answered Sep 25 '22 21:09

Richard Erickson