Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

AWS Lambda (Python) Fails to unzip and store files in S3

Project currently maintains S3 bucket which holds a large zip size 1.5 GB containing .xpt and .sas7dbat files. Unzipped file size is 20 GB.

Trying to unzip file and push the same folder structure to S3

Following code works for a small zip files but fails for large Zip file (1.5GB) :

for obj in bucket.objects.all():
    #file_name = os.path.abspath(obj.key) # get full path of files
    key = urlparse(obj.key.encode('utf8'))
    obj = client.get_object(Bucket='my-zip-bucket', Key=obj.key)

    with io.BytesIO(obj["Body"].read()) as tf:
        # rewind the file
        tf.seek(0)

        with zipfile.ZipFile(tf, mode='r') as zipf:
            for file in zipf.infolist():
                fileName = file.filename
                putFile = client.put_object(Bucket='my-un-zip-bucket-', Key=fileName, Body=zipf.read(file))
                putObjects.append(putFile)

Error : Memory Size: 3008 MB Max Memory Used: 3008 MB

I would like to validate :

  1. AWS-Lambda is not a suitable solution for large files ?
  2. Should I use different libraries / approach rather than reading everything in memory
like image 537
K.Pil Avatar asked May 11 '18 15:05

K.Pil


People also ask

Can I upload ZIP file to S3 and unzip?

So, if your ZIP data was stored on S3, this typically would involve downloading the ZIP file(s) to your local PC or Laptop, unzipping them with a third-party tool like WinZip, then re-uploading the unzipped data files back to S3 for further processing.

How do I extract a ZIP file in an Amazon S3 by using Lambda?

If you head to the Properties tab of your S3 bucket, you can set up an Event Notification for all object “create” events (or just PutObject events). As the destination, you can select the Lambda function where you will write your code to unzip and gzip files.

Can we extract ZIP file in S3?

S3 isn't really designed to allow this; normally you would have to download the file, process it and upload the extracted files. However, there may be a few options: You could mount the S3 bucket as a local filesystem using s3fs and FUSE (see article and github site).

Can Lambda upload file to S3?

Introduction to Cloud Computing on AWS for Beginners [2022]We can trigger AWS Lambda on S3 when there are any file uploads in S3 buckets. AWS Lambda has a handler function which acts as a start point for AWS Lambda function. The handler has the details of the events.


2 Answers

There is a serverless solution using AWS Glue! (I nearly died figuring this out)

This solution is two parts:

  1. A lambda function that is triggered by S3 upon upload of a ZIP file and creates a GlueJobRun - passing the S3 Object key as an argument to Glue.
  2. A Glue Job that unzips files (in memory!) and uploads back to S3.

See my code below which unzips the ZIP file and places the contents back into the same bucket (configurable).

Please upvote if helpful :)

Lambda Script (python3) that calls a Glue Job called YourGlueJob

import boto3
import urllib.parse

glue = boto3.client('glue')

def lambda_handler(event, context):
    bucket = event['Records'][0]['s3']['bucket']['name']
    key = urllib.parse.unquote_plus(event['Records'][0]['s3']['object']['key'], encoding='utf-8')
    print(key)    
try:
    newJobRun = glue.start_job_run(
        JobName = 'YourGlueJob',
        Arguments = {
            '--bucket':bucket,
            '--key':key,
        }
        )
    print("Successfully created unzip job")    
    return key  
except Exception as e:
    print(e)
    print('Error starting unzip job for' + key)
    raise e         

AWS Glue Job Script to unzip the files

import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job

## @params: [JOB_NAME]
args = getResolvedOptions(sys.argv, ['JOB_NAME','bucket','key'],)

sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)

import boto3
import zipfile
import io
from contextlib import closing

s3 = boto3.client('s3')
s3r = boto3.resource('s3')

bucket = args["bucket"]
key = args["key"]

obj = s3r.Object(
    bucket_name=bucket,
    key=key
)

buffer = io.BytesIO(obj.get()["Body"].read())
z = zipfile.ZipFile(buffer)
list = z.namelist()
for filerr in list:
    print(filerr)
    y=z.open(filerr)
    arcname = key + filerr
    x = io.BytesIO(y.read())
    s3.upload_fileobj(x, bucket, arcname)
    y.close()
print(list)


job.commit()
like image 99
Ganondorfz Avatar answered Oct 17 '22 00:10

Ganondorfz


As described in this AWS Lambda Limits link:

But there are limits that AWS Lambda imposes that include, for example, the size of your deployment package or the amount of memory your Lambda function is allocated per invocation.

Here, the issue you are having is because of "amount of memory Lambda function is allocated per invocation" needed. Unfortunately, Lambda is not an applicable solution for this case. You need to go with EC2 approach.

When your overall memory requirement is high, I don't think Lambda is great solution. I am not about how the specified file types work, but in general read/processing large files use chunked approach to avoid large memory requirements. Whether chunked approach works or not depends on your business requirement.

like image 26
INVOKE Cloud Avatar answered Oct 17 '22 00:10

INVOKE Cloud