I am new to spark and trying to understand how to deal with skewed data in spark. I have created two tables employee and department. Employee has skewed data for one of the department.
One of the solution is to broadcast the department table and that works fine. But I want to understand how could I use salting technique in below code to improve performance.
from pyspark.sql import SparkSession
import pyspark.sql.functions as f
spark = SparkSession.builder.appName("skewTestSpark").config("spark.sql.warehouse.dir",
'/user/hive/warehouse').enableHiveSupport().getOrCreate()
df1 = spark.sql("select * from spark.employee")
df2 = spark.sql("select id as dept_id, name as dept_name from spark.department")
res = df1.join(df2, df1.department==df2.dept_id)
res.write.parquet("hdfs://<host>:<port>/user/result/employee")
Distribution for above code:

It's unlikely that employees - even with skew - will cause a Spark bottleneck. In fact the example is flawed. Think of large large JOINs and not something that will fit into broadcast join category.
Salting: With "Salting" on SQL join or Grouping etc. operation, the key is changed to redistribute data in an even manner so that processing time for whatever operation any given partition is similar.
An excellent example for a JOIN is here: https://dzone.com/articles/why-your-spark-apps-are-slow-or-failing-part-ii-da
Another good read I would recommend is here: https://godatadriven.com/blog/b-efficient-large-spark-optimisation/
I could explain it all, but the first link explains it well enough. Some experimentation is needed to get a better key distribution.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With