Apache Spark: impact of repartitioning, sorting and caching on a join

Tags:

I am exploring Spark's behavior when joining a table to itself. I am using Databricks.

My dummy scenario is:

Read an external table as dataframe A (underlying files are in delta format)
Define dataframe B as dataframe A with only certain columns selected
Join dataframes A and B on column1 and column2

(Yes, it doesn't make much sense, I'm just experimenting to understand Spark's underlying mechanics)

a = spark.read.table("table") \
.select("column1", "column2", "column3", "column4") \
.withColumn("columnA", lower((concat(col("column4"), lit("_"), col("column5")))))

b = a.select("column1", "column2", "columnA")

c= a.join(b, how="left", on = ["column1", "column2"])

My first attempt was to run the code as it is (attempt 1). I then tried to repartition and cache (attempt 2)

a = spark.read.table("table") \
.select("column1", "column2", "column3", "column4") \
.withColumn("columnA", lower((concat(col("column4"), lit("_"), col("column5")))))
.repartition(col("column1"), col("column2")).cache()

Finally, I repartitioned, sorted and cached

 a = spark.read.table("table") \
.select("column1", "column2", "column3", "column4") \
.withColumn("columnA", lower((concat(col("column4"), lit("_"), col("column5")))))
.repartition(col("column1"), col("column2")).sortWithinPartitions(col("column1"), col("column2")).cache()

The respective dags generated are as attached.

My questions are:

Why in attempt 1 the table appears to be cached even though caching has not been explicitly specified.
Why InMemoreTableScan is always followed by another node of this type.
Why in attempt 3 caching appears to take place on two stages?
Why in attempt 3 WholeStageCodegen follows one (and only one) InMemoreTableScan.

attempt 1

attempt 2

enter image description here

367

asked Jan 03 '20 10:01

Dawid

1 Answers

What you are observing in these 3 plans is a mixture of DataBricks runtime and Spark.

First of all, while running Databricks runtime 3.3+, caching is automatically enabled for all parquet files. Corresponding config for that: spark.databricks.io.cache.enabled true

For your second query, InMemoryTableScan is happening twice because right when join was called, spark tried to compute Dataset A and Dataset B in parallel. Assuming different executors got assigned the above tasks, both will have to scan the table from (Databricks) cache.

For the third one, InMemoryTableScan does not refer to caching in itself. It just means that whatever plan catalyst formed involved scanning the cached table multiple times.

PS: I can't visualize the point 4 :)

167

answered Sep 28 '22 06:09

Ashvjit Singh

Related questions
                            
                                How to avoid one Spark Streaming window blocking another window with both running some native Python code
                            
                                Prevent more IO with multiple pipelines on the same RDD
                            
                                PCA in Spark MLlib and Spark ML
                            
                                How to get accuracy precision, recall and ROC from cross validation in Spark ml lib?
                            
                                How to clean spark history event log with out stopping spark streaming
                            
                                Performance decrease for huge amount of columns. Pyspark
                            
                                Disable spark catalyst optimizer
                            
                                Spark out of memory
                            
                                Does Spark optimize chained transformations?
                            
                                Multiple resolvers having different access mechanism configured with same name 'sbt-plugin-releases'
                            
                                Scalatest Maven Plugin "no tests were executed"
                            
                                "spark.memory.fraction" seems to have no effect
                            
                                When to use Spark DataFrame/Dataset API and when to use plain RDD?
                            
                                Apache Spark Handling Skewed Data
                            
                                Avoid starting HiveThriftServer2 with created context programmatically
                            
                                Can Spark Replace ETL Tool
                            
                                NullPointerException after extracting a Teradata table with Scala/Spark
                            
                                Bundling Python3 packages for PySpark results in missing imports
                            
                                Restarting Spark Structured Streaming Job consumes Millions of Kafka messages and dies
                            
                                Spark How to get number of Keys changed in two JSONS in Scala?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Apache Spark: impact of repartitioning, sorting and caching on a join

Tags:

apache-spark

pyspark

bigdata

azure-databricks

delta-lake

Dawid

People also ask

1 Answers

Ashvjit Singh

Recent Activity

Donate For Us