How can I download multiple objects from S3 simultaneously?

Tags:

I have lots (millions) of small log files in s3 in with its name (date/time) helping to define it i.e. servername-yyyy-mm-dd-HH-MM. e.g.

s3://my_bucket/uk4039-2015-05-07-18-15.csv
s3://my_bucket/uk4039-2015-05-07-18-16.csv
s3://my_bucket/uk4039-2015-05-07-18-17.csv
s3://my_bucket/uk4039-2015-05-07-18-18.csv
...
s3://my_bucket/uk4339-2015-05-07-19-23.csv
s3://my_bucket/uk4339-2015-05-07-19-24.csv
...
etc

From EC2, using the AWS CLI, I would like to simultaneously download all files that are have the minute equal 16 for 2015, for all only server uk4339 and uk4338

Is there a clever way to do this?

Also if this is a terrible file structure in s3 to query data, I would be extremely grateful for any advice on how to set this up better.

I can put a relevant aws s3 cp ... command into a loop in a shell/bash script to sequentially download the relevant files but, was wondering if there was something more efficient.

As an added bonus I would like to row bind the results together too as one csv.

A quick example of a mock csv file can be generated in R using this line of R code

R> write.csv(data.frame(cbind(a1=rnorm(100),b1=rnorm(100),c1=rnorm(100))),file='uk4339-2015-05-07-19-24.csv',row.names=FALSE)

The csv that is created is uk4339-2015-05-07-19-24.csv. FYI, I will be importing the combined data into R at the end.

812

asked May 07 '15 17:05

h.l.m

1 Answers

As you didn't answer my questions, nor indicate what OS you use, it is somewhat hard to make any concrete suggestions, so I will briefly suggest you use GNU Parallel to parallelise your S3 fetch requests to get around the latency.

Suppose you somehow generate a list of all the S3 files you want and put the resulting list in a file called GrabMe.txt like this

s3://my_bucket/uk4039-2015-05-07-18-15.csv
s3://my_bucket/uk4039-2015-05-07-18-16.csv
s3://my_bucket/uk4039-2015-05-07-18-17.csv
s3://my_bucket/uk4039-2015-05-07-18-18.csv

Then you can get them in parallel, say 32 at a time, like this:

parallel -j 32 echo aws s3 cp {} . < GrabMe.txt

or if you prefer reading left-to-right

cat GrabMe.txt | parallel -j 32 echo aws s3 cp {} .

You can obviously alter the number of parallel requests from 32 to any other number. At the moment, it just echoes the command it would run, but you can remove the word echo when you see how it works.

There is a good tutorial here, and Ole Tange (the author of GNU Parallel) is on SO, so we are in good company.

answered Oct 18 '22 08:10

Mark Setchell

Related questions
                            
                                melt.data.frame() changes behavior how POSIXct columns are printed
                            
                                Apply a function to all rows except the current one (dplyr)
                            
                                writing a data.frame using cat
                            
                                UTF-8 encoding with dplyr and SQLite
                            
                                Converting chr "00:00:00" to date-time "00:00:00"
                            
                                Can't plot a scale bar or north arrow on ggplot2
                            
                                Extract survival probabilities in Survfit by groups
                            
                                Remove variable wrapped in function from model formula in R
                            
                                Creating Package Documentation with RStudio?
                            
                                mlogit: missing value where TRUE/FALSE needed
                            
                                How to convert numpy array to R matrix? [duplicate]
                            
                                Create an R package with dependencies
                            
                                R data.table conditional aggregation
                            
                                How plot bars on top of grid lines when using barplot?
                            
                                "more" like command in R console [duplicate]
                            
                                Markdown: Change default font size of code chunks in ioslides
                            
                                Apply function to all rows in the loop and put the results in new column
                            
                                How to add a label for a vertical line with legend in ggplot2
                            
                                How to use require(googlesheets) properly?
                            
                                Using aggregate() on data.frame objects

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How can I download multiple objects from S3 simultaneously?

Tags:

r

amazon-web-services

amazon-s3

aws-cli

amazon-ec2

h.l.m

People also ask

1 Answers

Mark Setchell

Recent Activity

Donate For Us