Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Write pandas dataframe as compressed CSV directly to Amazon s3 bucket?

I currently have a script that reads the existing version of a csv saved to s3, combines that with the new rows in the pandas dataframe, and then writes that directly back to s3.

    try:
        csv_prev_content = str(s3_resource.Object('bucket-name', ticker_csv_file_name).get()['Body'].read(), 'utf8')
    except:
        csv_prev_content = ''

    csv_output = csv_prev_content + curr_df.to_csv(path_or_buf=None, header=False)
    s3_resource.Object('bucket-name', ticker_csv_file_name).put(Body=csv_output)

Is there a way that I can do this but with a gzip compressed csv? I want to read an existing .gz compressed csv on s3 if there is one, concatenate it with the contents of the dataframe, and then overwrite the .gz with the new combined compressed csv directly in s3 without having to make a local copy.

like image 648
rosstripi Avatar asked May 02 '17 02:05

rosstripi


People also ask

How do I convert a CSV file to an S3 bucket?

Navigate to All Settings > Raw Data Export > CSV Upload. Toggle the switch to ON. Select Amazon S3 Bucket from the dropdown menu. Enter your Access Key ID, Secret Access Key, and bucket name.

Can pandas write to CSV?

Pandas is a very powerful and popular framework for data analysis and manipulation. One of the most striking features of Pandas is its ability to read and write various types of files including CSV and Excel.

How do I create a CSV file using panda?

By using pandas. DataFrame. to_csv() method you can write/save/export a pandas DataFrame to CSV File. By default to_csv() method export DataFrame to a CSV file with comma delimiter and row index as the first column.


2 Answers

Here's a solution in Python 3.5.2 using Pandas 0.20.1.

The source DataFrame can be read from a S3, a local CSV, or whatever.

import boto3
import gzip
import pandas as pd
from io import BytesIO, TextIOWrapper

df = pd.read_csv('s3://ramey/test.csv')
gz_buffer = BytesIO()

with gzip.GzipFile(mode='w', fileobj=gz_buffer) as gz_file:
    df.to_csv(TextIOWrapper(gz_file, 'utf8'), index=False)

s3_resource = boto3.resource('s3')
s3_object = s3_resource.Object('ramey', 'new-file.csv.gz')
s3_object.put(Body=gz_buffer.getvalue())
like image 80
ramhiser Avatar answered Sep 22 '22 11:09

ramhiser


There is a more elegant solution using smart-open (https://pypi.org/project/smart-open/)

import pandas as pd
from smart_open import open

df.to_csv(open('s3://bucket/prefix/filename.csv.gz','w'),index = False)
like image 44
Alexander Lobkovsky Meitiv Avatar answered Sep 24 '22 11:09

Alexander Lobkovsky Meitiv