How to use Zorder clustering when writing delta table within PySpark?

Question

I'm trying to write a very large PySpark dataframe, following the advice I see in https://databricks.com/blog/2018/07/31/processing-petabytes-of-data-in-seconds-with-databricks-delta.html

However, this page shows advice in Scala, which I don't know how to translate to PySpark

I see Scala code like this:

spark.read.table(connRandom)
     .write.format("delta").saveAsTable(connZorder)

sql(s"OPTIMIZE $connZorder ZORDER BY (src_ip, src_port, dst_ip, dst_port)")

but how can I do the equivalent of the second line, say to zorder cluster on a specific column "my_col" in PySpark?

AdrianaT · Accepted Answer

The second line is a SQL command given from Scala. You can do the same in python with spark.sql("OPTIMIZE tableName ZORDER BY (my_col)").

Also take a look at the documentation, it has a full notebook example for PySpark.

How to use Zorder clustering when writing delta table within PySpark?

Tags:

apache-spark

apache-spark-sql

pyspark

databricks

con

1 Answers

AdrianaT

Recent Activity

Donate For Us

How to use Zorder clustering when writing delta table within PySpark?

Tags:

apache-spark

apache-spark-sql

pyspark

databricks

con

1 Answers

AdrianaT

Related questions

Recent Activity

Donate For Us