Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Read large csv file from S3 into R

Tags:

r

csv

amazon-s3

I need to load a 3 GB csv file with about 18 million rows and 7 columns from S3 into R or RStudio respectively. My code for reading data from S3 usually works like this:

library("aws.s3")
obj <-get_object("s3://myBucketName/aFolder/fileName.csv")  
csvcharobj <- rawToChar(obj)  
con <- textConnection(csvcharobj)  
data <- read.csv(file = con)

Now, with the file being much bigger than usual, I receive an error

> csvcharobj <- rawToChar(obj)  
Error in rawToChar(obj) : long vectors not supported yet: raw.c:68

Reading this post, I understand that the vector is too long but how would I subset the data in this case? Any other suggestion how to deal with larger files to read from S3?

like image 640
Tom Avatar asked Oct 10 '17 13:10

Tom


People also ask

How do I import a large CSV file into R?

If the CSV files are extremely large, the best way to import into R is using the fread() method from the data. table package. The output of the data will be in the form of Data table in this case.

How do I import a large data into R?

Loading a large dataset: use fread() or functions from readr instead of read. xxx() . If you really need to read an entire csv in memory, by default, R users use the read. table method or variations thereof (such as read.


2 Answers

Originally Building on Hugh's comment in the OP and adding an answer for those wishing to load regular size csv's from s3.

At least as of May 1, 2019, there is an s3read_using() function that allows you to read the object directly out of your bucket.

Thus

data <- 
    aws.s3::s3read_using(read.csv, object = "s3://your_bucketname/your_object_name.csv.gz")

Will do the trick. However, if you want to make your work run faster and cleaner, I prefer this:

data <- 
    aws.s3::s3read_using(fread, object = "s3://your_bucketname/your_object_name.csv.gz") %>%
    janitor::clean_names()

Previously the more verbose method below was required:

library(aws.s3)

data <- 
  save_object("s3://myBucketName/directoryName/fileName.csv") %>%
  data.table::fread()

It works for files up to at least 305 MB.

A better alternative to filling up your working directory with a copy of every csv you load:

data <- 
  save_object("s3://myBucketName/directoryName/fileName.csv",
              file = tempfile(fileext = ".csv")
             ) %>%
  fread()

If you are curious about where the tempfile is positioned, then Sys.getenv() can give some insight - see TMPDIR TEMP or TMP. More information can be found in the Base R tempfile docs..

like image 64
leerssej Avatar answered Oct 07 '22 04:10

leerssej


If you are on Spark or similar, a other workaround would be to - read/load the csv to DataTable and - continue processing it with R Server / sparklyr

like image 20
Ulrich Beck Avatar answered Oct 07 '22 03:10

Ulrich Beck