Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to extract files from a zip archive in S3

I have a zip archive uploaded in S3 in a certain location (say /foo/bar.zip) I would like to extract the values within bar.zip and place it under /foo without downloading or re-uploading the extracted files. How can I do this, so that S3 is treated pretty much like a file system

like image 904
Rpj Avatar asked Feb 03 '15 04:02

Rpj


People also ask

Can we unzip a zip file in S3?

If you head to the Properties tab of your S3 bucket, you can set up an Event Notification for all object “create” events (or just PutObject events). As the destination, you can select the Lambda function where you will write your code to unzip and gzip files.

How do I upload a zip file to Amazon S3?

To upload folders and files to an S3 bucketSign in to the AWS Management Console and open the Amazon S3 console at https://console.aws.amazon.com/s3/ . In the Buckets list, choose the name of the bucket that you want to upload your folders or files to. Choose Upload.


2 Answers

S3 isn't really designed to allow this; normally you would have to download the file, process it and upload the extracted files.

However, there may be a few options:

  1. You could mount the S3 bucket as a local filesystem using s3fs and FUSE (see article and github site). This still requires the files to be downloaded and uploaded, but it hides these operations away behind a filesystem interface.

  2. If your main concern is to avoid downloading data out of AWS to your local machine, then of course you could download the data onto a remote EC2 instance and do the work there, with or without s3fs. This keeps the data within Amazon data centers.

  3. You may be able to perform remote operations on the files, without downloading them onto your local machine, using AWS Lambda.

You would need to create, package and upload a small program written in node.js to access, decompress and upload the files. This processing will take place on AWS infrastructure behind the scenes, so you won't need to download any files to your own machine. See the FAQs.

Finally, you need to find a way to trigger this code - typically, in Lambda, this would be triggered automatically by upload of the zip file to S3. If the file is already there, you may need to trigger it manually, via the invoke-async command provided by the AWS API. See the AWS Lambda walkthroughs and API docs.

However, this is quite an elaborate way of avoiding downloads, and probably only worth it if you need to process large numbers of zip files! Note also that (as of Oct 2018) Lambda functions are limited to 15 minutes maximum duration (default timeout is 3 seconds), so may run out of time if your files are extremely large - but since scratch space in /tmp is limited to 500MB, your filesize is also limited.

like image 198
DNA Avatar answered Oct 12 '22 23:10

DNA


If keeping the data in AWS is the goal, you can use AWS Lambda to:

  1. Connect to S3 (I connect the Lambda function via a trigger from S3)
  2. Copy the data from S3
  3. Open the archive and decompress it (No need to write to disk)
  4. Do something with the data

If the function is initiated via a trigger, Lambda will suggest that you place the contents in a separate S3 location to avoid looping by accident. To open the archive, process it, and then return the contents you can do something like the following.

import csv, json
import os
import urllib.parse
import boto3
from zipfile import ZipFile
import io

s3 = boto3.client("s3")

def extract_zip(input_zip, file_name):
    contents = input_zip.read()
    input_zip = ZipFile(io.BytesIO(contents))
    return {name: input_zip.read(name) for name in input_zip.namelist()}
    
def lambda_handler(event, context):
    print("Received event: " + json.dumps(event, indent=2))
    # Get the object from the event and show its content type
    bucket = event["Records"][0]["s3"]["bucket"]["name"]
    key = urllib.parse.unquote_plus(
        event["Records"][0]["s3"]["object"]["key"], encoding="utf-8"
    )
    try:
        bucket = event['Records'][0]['s3']['bucket']['name']
        key = urllib.parse.unquote_plus(event['Records'][0]['s3']['object']['key'], encoding='utf-8')

        response = s3.get_object(Bucket=bucket, Key=key)
        # This example assumes the file to process shares the archive's name
        file_name = key.split(".")[0] + ".csv"
        print(f"Attempting to open {key} and read {file_name}")
        print("CONTENT TYPE: " + response["ContentType"])
        data = []
        contents = extract_zip(response["Body"], file_name)
        for k, v in contents.items():
            print(v)
            reader = csv.reader(io.StringIO(v.decode('utf-8')), delimiter=',')
            for row in reader:
                data.append(row)
        return {
            "statusCode": 200,
            "body": data
        }

    except Exception as e:
        print(e)
        print(
            "Error getting object {} from bucket {}. Make sure they exist and your bucket is in the same region as this function.".format(
                key, bucket
            )
        )
        raise e

The code above accesses the file contents through response['Body'] where response is an event triggered by S3. The response body will be an instance of a StreamingBody object which is a file like object with a few convenience functions. Use the read() method, passing an amt argument if you are processing large files or files of unknown sizes. Working on an archive in memory requires a few extra steps. You will need to process the contents correctly, so wrap it in a BytesIO object and open it with the standard library's ZipFile, documentation here. Once you have the data passed to ZipFile, you can call read() on the contents. You will need to figure out what to do from here for your specific use case. If the archives have more than one file inside, you will need logic for handling each one. My example assumes you have one or a few small csv files to process and returns a dictionary with the file name as the key and the value set to the file contents.

I have included the next step of reading the CSV files and returning the data and a status code 200 in the response. Keep in mind, your needs may be different. This example wraps the data in a StringIO object and uses a CSV reader to handle the data. Once the result is passed via the response, the Lambda function can hand off the processing to another AWS process.

like image 45
Nathan Avatar answered Oct 13 '22 01:10

Nathan