Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to grep into files stored in S3

Tags:

grep

amazon-s3

Do anybody know how to perform grep on S3 files with aws S3 directly into the bucket? For example I have FILE1.csv, FILE2.csv with many rows and want to look for the rows that contain string JZZ

aws s3 ls --recursive s3://mybucket/loaded/*.csv.gz | grep ‘JZZ’
like image 326
Msordi Avatar asked Dec 16 '16 07:12

Msordi


People also ask

How do I grep on AWS S3?

Explanation: The ls command generates a list of files then we select the file name from the output and for each file (xargs command) download the file from S3 and grep the output.

How do I browse files on S3 bucket?

In AWS Explorer, expand the Amazon S3 node, and double-click a bucket or open the context (right-click) menu for the bucket and choose Browse. In the Browse view of your bucket, choose Upload File or Upload Folder. In the File-Open dialog box, navigate to the files to upload, choose them, and then choose Open.

Can you search in S3?

The object search box within the Amazon S3 user interface allows you to search by prefix, or you can search using Amazon S3 API's LIST operation, which only returns 1,000 objects at a time.

How are files stored in S3?

S3 stores data as objects within buckets. An object is any file that can be stored on a file system. Buckets are the containers for objects. Buckets can have permissions for who can create, write, delete, and see objects within that bucket.


2 Answers

The aws s3 cp command can send output to stdout:

aws s3 cp s3://mybucket/foo.csv - | grep 'JZZ'

The dash (-) signals the command to send output to stdout.

See: How to use AWS S3 CLI to dump files to stdout in BASH?

like image 87
John Rotenstein Avatar answered Sep 20 '22 15:09

John Rotenstein


You can do it locally with the following command:

aws s3 ls --recursive s3://<bucket_name>/<path>/ | awk '{print $4}' | xargs -I FNAME sh -c "echo FNAME; aws s3 cp s3://<bucket_name>/FNAME - | grep --color=always '<regex_pattern>'"

Explanation: The ls command generates a list of files then we select the file name from the output and for each file (xargs command) download the file from S3 and grep the output.

I don't recommend this approach if you have to download a lot of data from S3 (due to transfer costs). You can avoid the costs for internet transfer though if you run the command on some EC2 instance that is located in a VPC with an S3 VPC endpoint attached to it.

like image 31
Eugen Avatar answered Sep 21 '22 15:09

Eugen