Pyspark partition data by a column and write parquet

Question

I need to write parquet files in seperate s3 keys by values in a column. The column city has thousands of values. Iteration using for loop, filtering dataframe by each column value and then writing parquet is very slow. Is there any way to partition the dataframe by the column city and write the parquet files?

What I am currently doing -

for city in cities:
  print(city)
  spark_df.filter(spark_df.city == city).write.mode('overwrite').parquet(f'reporting/date={date_string}/city={city}')

What I am currently doing -

for city in cities:
  print(city)
  spark_df.filter(spark_df.city == city).write.mode('overwrite').parquet(f'reporting/date={date_string}/city={city}')

sjishan · Accepted Answer

partitionBy function solves the issue

spark_df.write.partitionBy('date', 'city').parquet('reporting')

Pyspark partition data by a column and write parquet

Tags:

dataframe

apache-spark

pyspark

sjishan

1 Answers

sjishan

Recent Activity

Donate For Us

Pyspark partition data by a column and write parquet

Tags:

dataframe

apache-spark

pyspark

sjishan

1 Answers

sjishan

Related questions

Recent Activity

Donate For Us