spark: How does salting work in dealing with skewed data

Tags:

I have a skewed data in a table which is then compared with other table that is small. I understood that salting works in case of joins- that is a random number is appended to keys in big table with skew data from a range of random data and the rows in small table with no skew data are duplicated with the same range of random numbers. Hence the matching happens because there will be a hit in one among the duplicate values for particular salted key of skewed able. I also read that salting is helpful while performing groupby. My question is when random numbers are appended to the key doesn't it break the group? If it does then the meaning of group by operation has changed.

459

asked Sep 26 '19 05:09

Bishamon Ten

2 Answers

My question is when random numbers are appended to the key doesn't it break the group?

Well, it does, to mitigate this you could run group by operation twice. Firstly with salted key, then remove salting and group again. The second grouping will take partially aggregated data, thus significantly reduce skew impact.

E.g.

import org.apache.spark.sql.functions._

df.withColumn("salt", (rand * n).cast(IntegerType))
  .groupBy("salt", groupByFields)
  .agg(aggFields)
  .groupBy(groupByFields)
  .agg(aggFields)

195

answered Oct 21 '22 14:10

Gelerion

var df1 = Seq((1,"a"),(2,"b"),(1,"c"),(1,"x"),(1,"y"),(1,"g"),(1,"k"),(1,"u"),(1,"n")).toDF("ID","NAME") 

df1.createOrReplaceTempView("fact")

var df2 = Seq((1,10),(2,30),(3,40)).toDF("ID","SALARY")

df2.createOrReplaceTempView("dim")

val salted_df1 = spark.sql("""select concat(ID, '_', FLOOR(RAND(123456)*19)) as salted_key, NAME from fact """)

salted_df1.createOrReplaceTempView("salted_fact")

val exploded_dim_df = spark.sql(""" select ID, SALARY, explode(array(0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19)) as salted_key from dim""")

//val exploded_dim_df = spark.sql(""" select ID, SALARY, explode(array(0 to 19)) as salted_key from dim""")

exploded_dim_df.createOrReplaceTempView("salted_dim")

val result_df = spark.sql("""select split(fact.salted_key, '_')[0] as ID, dim.SALARY 
            from salted_fact fact 
            LEFT JOIN salted_dim dim 
            ON fact.salted_key = concat(dim.ID, '_', dim.salted_key) """)
display(result_df)

answered Oct 21 '22 15:10

Shiva

Related questions
                            
                                Can you copy straight from Parquet/S3 to Redshift using Spark SQL/Hive/Presto?
                            
                                What's the performance impact of converting between `DataFrame`, `RDD` and back?
                            
                                Spark submit YARN mode HADOOP_CONF_DIR contents
                            
                                apache spark master ui not working
                            
                                spark "basePath" option setting
                            
                                Access names of fields in struct Spark SQL
                            
                                Spark SQL's Scala API - TimestampType - No Encoder found for org.apache.spark.sql.types.TimestampType
                            
                                Spark dataframe add a row for every existing row
                            
                                Change the Datatype of columns in PySpark dataframe
                            
                                Java & Spark : add unique incremental id to dataset
                            
                                Pyspark transform method that's equivalent to the Scala Dataset#transform method
                            
                                How to query datasets in avro format?
                            
                                How to standardize ONE column in Spark using StandardScaler?
                            
                                What's the difference between Dataset.col() and functions.col() in Spark?
                            
                                How to transpose/pivot the rows data to column in Spark Scala? [duplicate]
                            
                                Spark-sqlserver connection
                            
                                How to make sure my DataFrame frees its memory?
                            
                                exception in thread main java.lang.exceptionininitializerError When installing spark without hadoop
                            
                                Join two DataFrames where the join key is different and only select some columns
                            
                                How to set environment variable in databricks?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

spark: How does salting work in dealing with skewed data

Tags:

join

group-by

apache-spark

apache-spark-sql

skew

Bishamon Ten

People also ask

2 Answers

Gelerion

Shiva

Recent Activity

Donate For Us