Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to use AWS CLI to only copy files in S3 bucket that match a given string pattern

Tags:

I'm using the AWS CLI to copy files from an S3 bucket to my R machine using a command like below:

  system(
    "aws s3 cp s3://my_bucket_location/ ~/my_r_location/ --recursive --exclude '*' --include '*trans*' --region us-east-1"
    )

This works as expected, i.e. it copies all files in my_bucket_location that have "trans" in the filename at that location.

The problem that I am facing is that I have other files with similar naming conventions that I don't want to import in this step. As an example, in the list below I only want to copy the first two files, not the last two:

File list
trans_120215.csv
trans_130215.csv
sum_trans_120215.csv
sum_trans_130215.csv

If I was using regex I could make it more specific like "^trans_\\d+" to bring in just the first two files, but this doesn't seem possible using AWS CLI. So my question is there a way to have more complex pattern matching using AWS CLI like below?

  system(
    "aws s3 cp s3://my_bucket_location/ ~/my_r_location/ --recursive --exclude '*' --include '^trans_\\d+' --region us-east-1"
    )

Please note that I can only use information about the file in question, i.e. that I want to import a file with pattern "^trans_\\d+", I can't use the fact that the other unwanted files contain sum_ at the start, because this is only an example there could be other files with similar names like "check_trans_120215.csv".

I have considered other alternatives like below, but hoping there is a way to adjust the copy command to avoid going down either of these routes:

  • Listing all items in the bucket > using regex in R to specify the files that I want > Only importing those files
  • Keeping the copy command as it is > delete unwanted files on the R machine after the copy
like image 749
Sam Gilbert Avatar asked Mar 25 '16 07:03

Sam Gilbert


People also ask

What is the command to copy files recursively in a folder to an S3 bucket?

When passed with the parameter --recursive the aws s3 cp command recursively copies all objects from source to destination. It can be used to download and upload large set of files from and to S3.

How do you copy the files from S3 bucket to another S3 bucket which is in different zone?

Create and attach an S3 bucket policy. Sign in to the AWS Management Console for your source account and open the Amazon S3 console. Choose your source S3 bucket and then choose Permissions. Under Bucket policy, choose Edit and then paste the bucket policy from the sourcebucket-policy.


2 Answers

The alternatives that you have listed are the best options because S3 CLI doesn't support regex.

Use of Exclude and Include Filters:

Currently, there is no support for the use of UNIX style wildcards in a command's path arguments. However, most commands have --exclude "" and --include "" parameters that can achieve the desired result. These parameters perform pattern matching to either exclude or include a particular file or object. The following pattern symbols are supported.

*: Matches everything
?: Matches any single character
[sequence]: Matches any character in sequence
[!sequence]: Matches any character not in sequence
like image 142
helloV Avatar answered Sep 28 '22 07:09

helloV


Putting this here for others to find, since I just had to figure this out. Here's what I came up with:

s3cmd del $(s3cmd ls s3://[BUCKET]/ | grep '.*s3://[BUCKET]/[FILENAME]' | cut -c 41-)

You can put the regex in the grep search string. For instance, I was searching for specific files to delete (hence the s3cmd del). My regex looked like: '2016-11-04.*s3.*[DN][RS].*'. You may have to adjust the cut for your use. Should also work with s3cmd get.

like image 23
crc32 Avatar answered Sep 28 '22 07:09

crc32