I'm using the AWS CLI to copy files from an S3 bucket to my R machine using a command like below:
system(
"aws s3 cp s3://my_bucket_location/ ~/my_r_location/ --recursive --exclude '*' --include '*trans*' --region us-east-1"
)
This works as expected, i.e. it copies all files in my_bucket_location that have "trans" in the filename at that location.
The problem that I am facing is that I have other files with similar naming conventions that I don't want to import in this step. As an example, in the list below I only want to copy the first two files, not the last two:
File list
trans_120215.csv
trans_130215.csv
sum_trans_120215.csv
sum_trans_130215.csv
If I was using regex I could make it more specific like "^trans_\\d+"
to bring in just the first two files, but this doesn't seem possible using AWS CLI. So my question is there a way to have more complex pattern matching using AWS CLI like below?
system(
"aws s3 cp s3://my_bucket_location/ ~/my_r_location/ --recursive --exclude '*' --include '^trans_\\d+' --region us-east-1"
)
Please note that I can only use information about the file in question, i.e. that I want to import a file with pattern "^trans_\\d+"
, I can't use the fact that the other unwanted files contain sum_ at the start, because this is only an example there could be other files with similar names like "check_trans_120215.csv".
I have considered other alternatives like below, but hoping there is a way to adjust the copy command to avoid going down either of these routes:
When passed with the parameter --recursive the aws s3 cp command recursively copies all objects from source to destination. It can be used to download and upload large set of files from and to S3.
Create and attach an S3 bucket policy. Sign in to the AWS Management Console for your source account and open the Amazon S3 console. Choose your source S3 bucket and then choose Permissions. Under Bucket policy, choose Edit and then paste the bucket policy from the sourcebucket-policy.
The alternatives that you have listed are the best options because S3 CLI doesn't support regex
.
Use of Exclude and Include Filters:
Currently, there is no support for the use of UNIX style wildcards in a command's path arguments. However, most commands have --exclude "" and --include "" parameters that can achieve the desired result. These parameters perform pattern matching to either exclude or include a particular file or object. The following pattern symbols are supported.
*: Matches everything
?: Matches any single character
[sequence]: Matches any character in sequence
[!sequence]: Matches any character not in sequence
Putting this here for others to find, since I just had to figure this out. Here's what I came up with:
s3cmd del $(s3cmd ls s3://[BUCKET]/ | grep '.*s3://[BUCKET]/[FILENAME]' | cut -c 41-)
You can put the regex in the grep search string. For instance, I was searching for specific files to delete (hence the s3cmd del). My regex looked like: '2016-11-04.*s3.*[DN][RS].*'
. You may have to adjust the cut for your use. Should also work with s3cmd get
.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With