How to use salting technique for joining data frames having skewed data

Question

I am new to spark and trying to understand how to deal with skewed data in spark. I have created two tables employee and department. Employee has skewed data for one of the department.

One of the solution is to broadcast the department table and that works fine. But I want to understand how could I use salting technique in below code to improve performance.

from pyspark.sql import SparkSession
import pyspark.sql.functions as f
spark = SparkSession.builder.appName("skewTestSpark").config("spark.sql.warehouse.dir",
                                       '/user/hive/warehouse').enableHiveSupport().getOrCreate()
df1 = spark.sql("select * from spark.employee")
df2 = spark.sql("select id as dept_id, name as dept_name from spark.department")
res = df1.join(df2, df1.department==df2.dept_id)
res.write.parquet("hdfs://<host>:<port>/user/result/employee")

Distribution for above code: enter image description here

thebluephantom · Accepted Answer

It's unlikely that employees - even with skew - will cause a Spark bottleneck. In fact the example is flawed. Think of large large JOINs and not something that will fit into broadcast join category.

Salting: With "Salting" on SQL join or Grouping etc. operation, the key is changed to redistribute data in an even manner so that processing time for whatever operation any given partition is similar.

An excellent example for a JOIN is here: https://dzone.com/articles/why-your-spark-apps-are-slow-or-failing-part-ii-da

Another good read I would recommend is here: https://godatadriven.com/blog/b-efficient-large-spark-optimisation/

I could explain it all, but the first link explains it well enough. Some experimentation is needed to get a better key distribution.

How to use salting technique for joining data frames having skewed data

Tags:

apache-spark

apache-spark-sql

pyspark

skew

Ankur

1 Answers

thebluephantom

Recent Activity

Donate For Us

How to use salting technique for joining data frames having skewed data

Tags:

apache-spark

apache-spark-sql

pyspark

skew

Ankur

1 Answers

thebluephantom

Related questions

Recent Activity

Donate For Us