I need to load a 3 GB csv file with about 18 million rows and 7 columns from S3 into R or RStudio respectively. My code for reading data from S3 usually works like this: <pre class="prettyprint"><code>library("aws.s3") obj <-get_object("s3://myBucketName/aFolder/fileName.csv") csvcharobj <- rawToChar(obj) con <- textConnection(csvcharobj) data <- read.csv(file = con) </code></pre> Now, with the file being much bigger than usual, I receive an error <pre class="prettyprint"><code>> csvcharobj <- rawToChar(obj) Error in rawToChar(obj) : long vectors not supported yet: raw.c:68 </code></pre> Reading this post, I understand that the vector is too long but how would I subset the data in this case? Any other suggestion how to deal with larger files to read from S3?

Originally Building on Hugh's comment in the OP and adding an answer for those wishing to load regular size csv's from s3. At least as of May 1, 2019, there is an s3read_using() function that allows you to read the object directly out of your bucket. Thus <pre class="prettyprint"><code>data <- aws.s3::s3read_using(read.csv, object = "s3://your_bucketname/your_object_name.csv.gz") </code></pre> Will do the trick. However, if you want to make your work run faster and cleaner, I prefer this: <pre class="prettyprint"><code>data <- aws.s3::s3read_using(fread, object = "s3://your_bucketname/your_object_name.csv.gz") %>% janitor::clean_names() </code></pre> Previously the more verbose method below was required: <pre class="prettyprint"><code>library(aws.s3) data <- save_object("s3://myBucketName/directoryName/fileName.csv") %>% data.table::fread() </code></pre> It works for files up to at least 305 MB. A better alternative to filling up your working directory with a copy of every csv you load: <pre class="prettyprint"><code>data <- save_object("s3://myBucketName/directoryName/fileName.csv", file = tempfile(fileext = ".csv") ) %>% fread() </code></pre> If you are curious about where the tempfile is positioned, then <code>Sys.getenv()</code> can give some insight - see <code>TMPDIR</code> <code>TEMP</code> or <code>TMP</code>. More information can be found in the Base R tempfile docs..

Read large csv file from S3 into R

I need to load a 3 GB csv file with about 18 million rows and 7 columns from S3 into R or RStudio respectively. My code for reading data from S3 usually works like this:

library("aws.s3")
obj <-get_object("s3://myBucketName/aFolder/fileName.csv")  
csvcharobj <- rawToChar(obj)  
con <- textConnection(csvcharobj)  
data <- read.csv(file = con)

Now, with the file being much bigger than usual, I receive an error

> csvcharobj <- rawToChar(obj)  
Error in rawToChar(obj) : long vectors not supported yet: raw.c:68

Reading this post, I understand that the vector is too long but how would I subset the data in this case? Any other suggestion how to deal with larger files to read from S3?

How do I import a large CSV file into R?

If the CSV files are extremely large, the best way to import into R is using the fread() method from the data. table package. The output of the data will be in the form of Data table in this case.

How do I import a large data into R?

Loading a large dataset: use fread() or functions from readr instead of read. xxx() . If you really need to read an entire csv in memory, by default, R users use the read. table method or variations thereof (such as read.

Originally Building on Hugh's comment in the OP and adding an answer for those wishing to load regular size csv's from s3.

At least as of May 1, 2019, there is an s3read_using() function that allows you to read the object directly out of your bucket.

Thus

data <- 
    aws.s3::s3read_using(read.csv, object = "s3://your_bucketname/your_object_name.csv.gz")

Will do the trick. However, if you want to make your work run faster and cleaner, I prefer this:

data <- 
    aws.s3::s3read_using(fread, object = "s3://your_bucketname/your_object_name.csv.gz") %>%
    janitor::clean_names()

Previously the more verbose method below was required:

library(aws.s3)

data <- 
  save_object("s3://myBucketName/directoryName/fileName.csv") %>%
  data.table::fread()

It works for files up to at least 305 MB.

A better alternative to filling up your working directory with a copy of every csv you load:

data <- 
  save_object("s3://myBucketName/directoryName/fileName.csv",
              file = tempfile(fileext = ".csv")
             ) %>%
  fread()

If you are curious about where the tempfile is positioned, then Sys.getenv() can give some insight - see TMPDIR TEMP or TMP. More information can be found in the Base R tempfile docs..

If you are on Spark or similar, a other workaround would be to - read/load the csv to DataTable and - continue processing it with R Server / sparklyr

Read large csv file from S3 into R

Tags:

r

csv

amazon-s3

Tom

People also ask

2 Answers

leerssej

Ulrich Beck

Recent Activity

Donate For Us

Read large csv file from S3 into R

Tags:

r

csv

amazon-s3

Tom

People also ask

2 Answers

leerssej

Ulrich Beck

Related questions

Recent Activity

Donate For Us