Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Reading contents of a gzip file from a AWS S3 in Python

I am trying to read some logs from a Hadoop process that I run in AWS. The logs are stored in an S3 folder and have the following path.

bucketname = name key = y/z/stderr.gz Here Y is the cluster id and z is a folder name. Both of these act as folders(objects) in AWS. So the full path is like x/y/z/stderr.gz.

Now I want to unzip this .gz file and read the contents of the file. I don't want to download this file to my system wants to save contents in a python variable.

This is what I have tried till now.

bucket_name = "name" key = "y/z/stderr.gz" obj = s3.Object(bucket_name,key) n = obj.get()['Body'].read() 

This is giving me a format which is not readable. I also tried

n = obj.get()['Body'].read().decode('utf-8') 

which gives an error utf8' codec can't decode byte 0x8b in position 1: invalid start byte.

I have also tried

gzip = StringIO(obj) gzipfile = gzip.GzipFile(fileobj=gzip) content = gzipfile.read() 

This returns an error IOError: Not a gzipped file

Not sure how to decode this .gz file.

Edit - Found a solution. Needed to pass n in it and use BytesIO

gzip = BytesIO(n) 
like image 685
Kshitij Marwah Avatar asked Dec 15 '16 09:12

Kshitij Marwah


1 Answers

This is old, but you no longer need the BytesIO object in the middle of it (at least on my boto3==1.9.223 and python3.7)

import boto3 import gzip  s3 = boto3.resource("s3") obj = s3.Object("YOUR_BUCKET_NAME", "path/to/your_key.gz") with gzip.GzipFile(fileobj=obj.get()["Body"]) as gzipfile:     content = gzipfile.read() print(content) 
like image 173
Kirk Avatar answered Oct 02 '22 17:10

Kirk