I am joining two big datasets using Spark RDD. One dataset is very much skewed so few of the executor tasks taking a long time to finish the job. How can I solve this scenario?

Pretty good article on how it can be done: https://datarus.wordpress.com/2015/05/04/fighting-the-skew-in-spark/ Short version: <ul> <li>Add random element to large RDD and create new join key with it</li> <li>Add random element to small RDD using explode/flatMap to increase number of entries and create new join key</li> <li>Join RDDs on new join key which will now be distributed better due to random seeding</li> </ul>

Skewed dataset join in Spark?

1 Answers

Pretty good article on how it can be done: https://datarus.wordpress.com/2015/05/04/fighting-the-skew-in-spark/

Short version:

Add random element to large RDD and create new join key with it
Add random element to small RDD using explode/flatMap to increase number of entries and create new join key
Join RDDs on new join key which will now be distributed better due to random seeding

145

answered Oct 14 '22 04:10

LiMuBei

Related questions
                            
                                Left Outer join and an additional where clause
                            
                                Mysql join query for multiple "tags" (many-to-many relationship) that matches ALL tags?
                            
                                SQL join against date ranges?
                            
                                How to make a "distinct" join with MySQL
                            
                                In what order are MySQL JOINs evaluated?
                            
                                Update Statement using Join and Group By
                            
                                Pandas: Find rows which don't exist in another DataFrame by multiple columns
                            
                                Return only one row from the right-most table for every row in the left-most table
                            
                                Laravel join with 3 Tables
                            
                                Join 3 tables in SQLite database
                            
                                LEFT OUTER JOIN with a WHERE clause
                            
                                How can I join two tables but only return rows that don't match?
                            
                                T-SQL Subquery Max(Date) and Joins
                            
                                LIMITing an SQL JOIN
                            
                                Grooviest way to join collection of Strings in Groovy 2.x
                            
                                SQL Joins vs Single Table : Performance Difference?
                            
                                Which performs first WHERE clause or JOIN clause
                            
                                dplyr: inner_join with a partial string match
                            
                                JOIN (SELECT ... ) ue ON 1=1?
                            
                                MySQL Left Join not returning null values for joined table

Skewed dataset join in Spark?

Tags:

join

apache-spark

Raj Kumar

People also ask

1 Answers

LiMuBei

Recent Activity

Donate For Us