from pyspark import SparkContext, SparkConf
from pyspark.sql import SparkSession
import gc
import pandas as pd
import datetime
import numpy as np
import sys
APP_NAME = "DataFrameToCSV"
spark = SparkSession\
.builder\
.appName(APP_NAME)\
.config("spark.sql.crossJoin.enabled","true")\
.getOrCreate()
group_ids = [1,1,1,1,1,1,1,2,2,2,2,2,2,2]
dates = ["2016-04-01","2016-04-01","2016-04-01","2016-04-20","2016-04-20","2016-04-28","2016-04-28","2016-04-05","2016-04-05","2016-04-05","2016-04-05","2016-04-20","2016-04-20","2016-04-29"]
#event = [0,1,0,0,0,0,1,1,0,0,0,0,1,0]
event = [0,1,1,0,1,0,1,0,0,1,0,0,0,0]
dataFrameArr = np.column_stack((group_ids,dates,event))
df = pd.DataFrame(dataFrameArr,columns = ["group_ids","dates","event"])
The above python code is to be run on a spark cluster on gcloud dataproc. I would like to save the pandas dataframe as csv file in gcloud storage bucket at gs://mybucket/csv_data/
How do I do this?
You can also use this solution with Dask. You can convert your DataFrame to Dask DataFrame, which can be written to csv on Cloud Storage
import dask.dataframe as dd
import pandas
df # your Pandas DataFrame
ddf = dd.from_pandas(df,npartitions=1, sort=True)
ddf.to_csv('gs://YOUR_BUCKET/ddf-*.csv', index=False, sep=',', header=False,
storage_options={'token': gcs.session.credentials})
storage_options argument is optional
So, I figured out how to do this. Continuing on from the above code, here is the solution:
sc = SparkContext.getOrCreate()
from pyspark.sql import SQLContext
sqlCtx = SQLContext(sc)
sparkDf = sqlCtx.createDataFrame(df)
sparkDf.coalesce(1).write.option("header","true").csv('gs://mybucket/csv_data')
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With