I have AWS Config sending snapshots of my AWS system to an S3 bucket every 12 hours. They are JSON files that are stored in a .json.gz format that contain information about the entire AWS system. On object creation in the bucket, a Lambda function is triggered to read that file. My plan is to read the JSON information in the function, parse through the data and create reports that describe certain elements of the AWS system, and push those reports to another S3 bucket.
My current code is:
data = s3.get_object(Bucket=bucket, Key=key)
text = data['Body'].read().decode('utf-8')
json_data = json.loads(text)
The error I am currently getting is: 'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte
My guess is this error is saying that certain bytes in data['Body'] are not ASCII characters. Clearly I cannot decode using standard utf-8, so I would like to unzip the .gz file instead. Is there a way to do this? I have already looked into zipfile.py but I can't really gather any information about my use case. Thanks.
You're correct - you can't decode this into text. You'll want something like:
import io
import gzip
import json
import boto3
from urllib.parse import unquote_plus
def handler_name(event, context): 
    s3client = boto3.client('s3')
    for record in event['Records']:
        bucket = record['s3']['bucket']['name']
        key = unquote_plus(record['s3']['object']['key'])
        response = s3client.get_object(Bucket=bucket, Key=key)
        content = response['Body'].read()
        with gzip.GzipFile(fileobj=io.BytesIO(content), mode='rb') as fh:
            yourJson = json.load(fh)
You can then use the yourJson variable to read the JSON. 
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With