Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Save pandas data frame as csv on to gcloud storage bucket

from pyspark import SparkContext, SparkConf
from pyspark.sql import SparkSession
import gc
import pandas as pd
import datetime
import numpy as np
import sys



APP_NAME = "DataFrameToCSV"

spark = SparkSession\
    .builder\
    .appName(APP_NAME)\
    .config("spark.sql.crossJoin.enabled","true")\
    .getOrCreate()

group_ids = [1,1,1,1,1,1,1,2,2,2,2,2,2,2]

dates = ["2016-04-01","2016-04-01","2016-04-01","2016-04-20","2016-04-20","2016-04-28","2016-04-28","2016-04-05","2016-04-05","2016-04-05","2016-04-05","2016-04-20","2016-04-20","2016-04-29"]

#event = [0,1,0,0,0,0,1,1,0,0,0,0,1,0]
event = [0,1,1,0,1,0,1,0,0,1,0,0,0,0]

dataFrameArr = np.column_stack((group_ids,dates,event))

df = pd.DataFrame(dataFrameArr,columns = ["group_ids","dates","event"])

The above python code is to be run on a spark cluster on gcloud dataproc. I would like to save the pandas dataframe as csv file in gcloud storage bucket at gs://mybucket/csv_data/

How do I do this?

like image 841
Srikant Chari Avatar asked Jan 04 '23 15:01

Srikant Chari


2 Answers

You can also use this solution with Dask. You can convert your DataFrame to Dask DataFrame, which can be written to csv on Cloud Storage

import dask.dataframe as dd
import pandas
df # your Pandas DataFrame
ddf = dd.from_pandas(df,npartitions=1, sort=True)
ddf.to_csv('gs://YOUR_BUCKET/ddf-*.csv', index=False, sep=',', header=False,  
                               storage_options={'token': gcs.session.credentials}) 

storage_options argument is optional

like image 121
Porada Kev Avatar answered Jan 06 '23 05:01

Porada Kev


So, I figured out how to do this. Continuing on from the above code, here is the solution:

sc = SparkContext.getOrCreate()  

from pyspark.sql import SQLContext
sqlCtx = SQLContext(sc)
sparkDf = sqlCtx.createDataFrame(df)    
sparkDf.coalesce(1).write.option("header","true").csv('gs://mybucket/csv_data')
like image 28
Srikant Chari Avatar answered Jan 06 '23 06:01

Srikant Chari