I need to load a 3 GB csv file with about 18 million rows and 7 columns from S3 into R or RStudio respectively. My code for reading data from S3 usually works like this:
library("aws.s3")
obj <-get_object("s3://myBucketName/aFolder/fileName.csv")
csvcharobj <- rawToChar(obj)
con <- textConnection(csvcharobj)
data <- read.csv(file = con)
Now, with the file being much bigger than usual, I receive an error
> csvcharobj <- rawToChar(obj)
Error in rawToChar(obj) : long vectors not supported yet: raw.c:68
Reading this post, I understand that the vector is too long but how would I subset the data in this case? Any other suggestion how to deal with larger files to read from S3?
If the CSV files are extremely large, the best way to import into R is using the fread() method from the data. table package. The output of the data will be in the form of Data table in this case.
Loading a large dataset: use fread() or functions from readr instead of read. xxx() . If you really need to read an entire csv in memory, by default, R users use the read. table method or variations thereof (such as read.
Originally Building on Hugh's comment in the OP and adding an answer for those wishing to load regular size csv's from s3.
At least as of May 1, 2019, there is an s3read_using() function that allows you to read the object directly out of your bucket.
Thus
data <-
aws.s3::s3read_using(read.csv, object = "s3://your_bucketname/your_object_name.csv.gz")
Will do the trick. However, if you want to make your work run faster and cleaner, I prefer this:
data <-
aws.s3::s3read_using(fread, object = "s3://your_bucketname/your_object_name.csv.gz") %>%
janitor::clean_names()
Previously the more verbose method below was required:
library(aws.s3)
data <-
save_object("s3://myBucketName/directoryName/fileName.csv") %>%
data.table::fread()
It works for files up to at least 305 MB.
A better alternative to filling up your working directory with a copy of every csv you load:
data <-
save_object("s3://myBucketName/directoryName/fileName.csv",
file = tempfile(fileext = ".csv")
) %>%
fread()
If you are curious about where the tempfile is positioned, then Sys.getenv()
can give some insight - see TMPDIR
TEMP
or TMP
. More information can be found in the Base R tempfile docs..
If you are on Spark or similar, a other workaround would be to - read/load the csv to DataTable and - continue processing it with R Server / sparklyr
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With