Once the CSV is loaded via read.csv
, it's fairly trivial to use multicore
, segue
etc to play around with the data in the CSV. Reading it in, however, is quite the time sink.
Realise it's better to use mySQL etc etc.
Assume the use of an AWS 8xl cluster compute instance running R2.13
Specs as follows:
Cluster Compute Eight Extra Large specifications:
88 EC2 Compute Units (Eight-core 2 x Intel Xeon)
60.5 GB of memory
3370 GB of instance storage
64-bit platform
I/O Performance: Very High (10 Gigabit Ethernet)
Any thoughts / ideas much appreciated.
Method 3: Using fread() method If the CSV files are extremely large, the best way to import into R is using the fread() method from the data. table package. The output of the data will be in the form of Data table in this case.
Loading a large dataset: use fread() or functions from readr instead of read. xxx() . If you really need to read an entire csv in memory, by default, R users use the read. table method or variations thereof (such as read.
Compare the Read Times table package is around 40 times faster than the base package and 8.5 times faster than the read_csv from the readr package.
Going parallel might not be needed if you use fread
in data.table
.
library(data.table)
dt <- fread("myFile.csv")
A comment to this question illustrates its power. Also here's an example from my own experience:
d1 <- fread('Tr1PointData_ByTime_new.csv')
Read 1048575 rows and 5 (of 5) columns from 0.043 GB file in 00:00:09
I was able to read in 1.04 million rows in under 10s!
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With