Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can I download multiple objects from S3 simultaneously?

I have lots (millions) of small log files in s3 in with its name (date/time) helping to define it i.e. servername-yyyy-mm-dd-HH-MM. e.g.

s3://my_bucket/uk4039-2015-05-07-18-15.csv
s3://my_bucket/uk4039-2015-05-07-18-16.csv
s3://my_bucket/uk4039-2015-05-07-18-17.csv
s3://my_bucket/uk4039-2015-05-07-18-18.csv
...
s3://my_bucket/uk4339-2015-05-07-19-23.csv
s3://my_bucket/uk4339-2015-05-07-19-24.csv
...
etc

From EC2, using the AWS CLI, I would like to simultaneously download all files that are have the minute equal 16 for 2015, for all only server uk4339 and uk4338

Is there a clever way to do this?

Also if this is a terrible file structure in s3 to query data, I would be extremely grateful for any advice on how to set this up better.

I can put a relevant aws s3 cp ... command into a loop in a shell/bash script to sequentially download the relevant files but, was wondering if there was something more efficient.

As an added bonus I would like to row bind the results together too as one csv.

A quick example of a mock csv file can be generated in R using this line of R code

R> write.csv(data.frame(cbind(a1=rnorm(100),b1=rnorm(100),c1=rnorm(100))),file='uk4339-2015-05-07-19-24.csv',row.names=FALSE)

The csv that is created is uk4339-2015-05-07-19-24.csv. FYI, I will be importing the combined data into R at the end.

like image 812
h.l.m Avatar asked May 07 '15 17:05

h.l.m


People also ask

How do I download all items from S3 bucket?

To download an entire bucket to your local file system, use the AWS CLI sync command, passing it the s3 bucket as a source and a directory on your file system as a destination, e.g. aws s3 sync s3://YOUR_BUCKET . . The sync command recursively copies the contents of the source to the destination.

How do I download AWS S3 objects?

You can download an object from an S3 bucket in any of the following ways: Select the object and choose Download or choose Download as from the Actions menu if you want to download the object to a specific folder. If you want to download a specific version of the object, select the Show versions button.

Why is downloading from S3 so slow?

Large object size For very large Amazon S3 objects, you might notice slow download times as your web browser tries to download the entire object. Instead, try downloading large objects with a ranged GET request using the Amazon S3 API.


1 Answers

As you didn't answer my questions, nor indicate what OS you use, it is somewhat hard to make any concrete suggestions, so I will briefly suggest you use GNU Parallel to parallelise your S3 fetch requests to get around the latency.

Suppose you somehow generate a list of all the S3 files you want and put the resulting list in a file called GrabMe.txt like this

s3://my_bucket/uk4039-2015-05-07-18-15.csv
s3://my_bucket/uk4039-2015-05-07-18-16.csv
s3://my_bucket/uk4039-2015-05-07-18-17.csv
s3://my_bucket/uk4039-2015-05-07-18-18.csv

Then you can get them in parallel, say 32 at a time, like this:

parallel -j 32 echo aws s3 cp {} . < GrabMe.txt

or if you prefer reading left-to-right

cat GrabMe.txt | parallel -j 32 echo aws s3 cp {} . 

You can obviously alter the number of parallel requests from 32 to any other number. At the moment, it just echoes the command it would run, but you can remove the word echo when you see how it works.

There is a good tutorial here, and Ole Tange (the author of GNU Parallel) is on SO, so we are in good company.

like image 96
Mark Setchell Avatar answered Oct 18 '22 08:10

Mark Setchell