Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to extract files in S3 on the fly with boto3?

Tags:

I'm trying to find a way to extract .gz files in S3 on the fly, that is no need to download it to locally, extract and then push it back to S3.

With boto3 + lambda, how can i achieve my goal?

I didn't see any extract part in boto3 document.

like image 594
The One Avatar asked Jul 11 '18 02:07

The One


People also ask

Can we unzip files in S3?

So, if your ZIP data was stored on S3, this typically would involve downloading the ZIP file(s) to your local PC or Laptop, unzipping them with a third-party tool like WinZip, then re-uploading the unzipped data files back to S3 for further processing.

Can I read S3 file without downloading?

Reading objects without downloading them Similarly, if you want to upload and read small pieces of textual data such as quotes, tweets, or news articles, you can do that using the S3 resource method put(), as demonstrated in the example below (Gist).


2 Answers

You can use BytesIO to stream the file from S3, run it through gzip, then pipe it back up to S3 using upload_fileobj to write the BytesIO.

# python imports import boto3 from io import BytesIO import gzip  # setup constants bucket = '<bucket_name>' gzipped_key = '<key_name.gz>' uncompressed_key = '<key_name>'  # initialize s3 client, this is dependent upon your aws config being done  s3 = boto3.client('s3', use_ssl=False)  # optional s3.upload_fileobj(                      # upload a new obj to s3     Fileobj=gzip.GzipFile(              # read in the output of gzip -d         None,                           # just return output as BytesIO         'rb',                           # read binary         fileobj=BytesIO(s3.get_object(Bucket=bucket, Key=gzipped_key)['Body'].read())),     Bucket=bucket,                      # target bucket, writing to     Key=uncompressed_key)               # target key, writing to 

Ensure that your key is reading in correctly:

# read the body of the s3 key object into a string to ensure download s = s3.get_object(Bucket=bucket, Key=gzip_key)['Body'].read() print(len(s))  # check to ensure some data was returned 
like image 117
Todd Jones Avatar answered Sep 21 '22 05:09

Todd Jones


The above answers are for gzip files, for zip files, you may try

import boto3 import zipfile from io import BytesIO bucket = 'bucket1'  s3 = boto3.client('s3', use_ssl=False) Key_unzip = 'result_files/'  prefix      = "folder_name/" zipped_keys =  s3.list_objects_v2(Bucket=bucket, Prefix=prefix, Delimiter = "/") file_list = [] for key in zipped_keys['Contents']:     file_list.append(key['Key']) #This will give you list of files in the folder you mentioned as prefix s3_resource = boto3.resource('s3') #Now create zip object one by one, this below is for 1st file in file_list zip_obj = s3_resource.Object(bucket_name=bucket, key=file_list[0]) print (zip_obj) buffer = BytesIO(zip_obj.get()["Body"].read())  z = zipfile.ZipFile(buffer) for filename in z.namelist():     file_info = z.getinfo(filename)     s3_resource.meta.client.upload_fileobj(         z.open(filename),         Bucket=bucket,         Key='result_files/' + f'{filename}') 

This will work for your zip file and your result unzipped data will be in result_files folder. Make sure to increase memory and time on AWS Lambda to maximum since some files are pretty large and needs time to write.

like image 36
Hari_pb Avatar answered Sep 20 '22 05:09

Hari_pb