Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

AWS Lambda: How to extract a tgz file in a S3 bucket and put it in another S3 bucket

I have an S3 bucket named "Source". Many '.tgz' files are being pushed into that bucket in real-time. I wrote an Java code for extracting the '.tgz' file and pushing it into "Destination" bucket. I pushed my code as Lambda function. I got the '.tgz' file as InputStream in my Java code. How to extract it in Lambda ? I'm not able to create a file in Lambda, it throws "FileNotFound(Permission Denied)" in JAVA.

AmazonS3 s3Client = new AmazonS3Client();
S3Object s3Object = s3Client.getObject(new GetObjectRequest(srcBucket, srcKey));
InputStream objectData = s3Object.getObjectContent();
File file = new File(s3Object.getKey());
OutputStream writer = new BufferedOutputStream(new FileOutputStream(file)); <--- It throws FileNotFound(Permission denied) here
like image 564
Avis Avatar asked Dec 19 '22 20:12

Avis


1 Answers

Since one of the responses was in Python i provide alternative solution in this language.

Problem with the solution using /tmp file-system is, that AWS allows to store only 512 MB there (read more). In order to untar or unzip larger files it's better to use io package and BytesIO class and process file contents purely in memory. AWS allows to assign up to 3GB of RAM to a Lambda and this extends max file size significantly. I successfully tested untar'ing with 1GB S3 file.

In my case un-taring of ~2000 files from 1GB tar-file to another S3 bucket took 140 seconds. It can by further optimized by utilizing multiple threads for uploading un-tarred files to target S3 bucket.

Example code below present single-threaded solution:

import boto3
import botocore
import tarfile

from io import BytesIO
s3_client = boto3.client('s3')

def untar_s3_file(event, context):

    bucket = event['Records'][0]['s3']['bucket']['name']
    key = event['Records'][0]['s3']['object']['key']

    input_tar_file = s3_client.get_object(Bucket = bucket, Key = key)
    input_tar_content = input_tar_file['Body'].read()

    with tarfile.open(fileobj = BytesIO(input_tar_content)) as tar:
        for tar_resource in tar:
            if (tar_resource.isfile()):
                inner_file_bytes = tar.extractfile(tar_resource).read()
                s3_client.upload_fileobj(BytesIO(inner_file_bytes), Bucket = bucket, Key = tar_resource.name)
like image 125
Łukasz Wachowicz Avatar answered Dec 21 '22 10:12

Łukasz Wachowicz