Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pyspark partition data by a column and write parquet

I need to write parquet files in seperate s3 keys by values in a column. The column city has thousands of values. Iteration using for loop, filtering dataframe by each column value and then writing parquet is very slow. Is there any way to partition the dataframe by the column city and write the parquet files?

What I am currently doing -

for city in cities:
  print(city)
  spark_df.filter(spark_df.city == city).write.mode('overwrite').parquet(f'reporting/date={date_string}/city={city}')
like image 869
sjishan Avatar asked Oct 19 '25 13:10

sjishan


1 Answers

partitionBy function solves the issue

spark_df.write.partitionBy('date', 'city').parquet('reporting')
like image 102
sjishan Avatar answered Oct 21 '25 13:10

sjishan



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!