Why my BroadcastHashJoin is slower than ShuffledHashJoin in Spark

Tags:

I execute a join using a javaHiveContext in Spark.

The big table is 1,76Gb and has 100 millions record.

The second table is 273Mb and has 10 millions record.

I get a JavaSchemaRDD and I call count() on it:

String query="select attribute7,count(*) from ft,dt where ft.chiavedt=dt.chiavedt group by attribute7";

JavaSchemaRDD rdd=sqlContext.sql(query);

System.out.println("count="+rdd.count());

If I force a broadcastHashJoin (SET spark.sql.autoBroadcastJoinThreshold=290000000) and use 5 executor on 5 node with 8 core and 20Gb of memory it is executed in 100 sec. If i don't force broadcast it is executed in 30 sec.

N.B. the tables are stored as Parquet file.

509

asked Dec 07 '15 16:12

Fabio

1 Answers

Most likely the source of the problem is a cost of broadcasting. To make things simple lets assume that you have 1800MB in the larger RDD and 300MB in the smaller one. Assuming 5 executors and no previous partitioning a fifth of all data should be already on the correct machine. It lefts ~1700MB for shuffling in case of standard join.

For broadcast join the smaller RDD has to be transfered to all nodes. It means around 1500MB data to be transfered. If you add required communication with driver it means you have to move a comparable amount of data in a much more expensive way. A broadcasted data has to be collected first and only after that can be forwarded to all the workers.

answered Oct 28 '22 04:10

zero323

Related questions
                            
                                Pig approach to pairing data fields in a data set
                            
                                Can apache flume hdfs sink accept dynamic path to write?
                            
                                Load snappy-compressed files into Elastic MapReduce
                            
                                Building Hadoop with Maven - "Failed to execute goal org.apache.maven.plugins:maven-antrun-plugin:1.6:run (create-testdirs)"
                            
                                How to get the SerDe Properties of an existing Hive Table
                            
                                Impala on Hadoop 2.2.0 without CDH?
                            
                                Hadoop maps are failing due to ConnectException
                            
                                Flume: Directory to Avro -> Avro to HDFS - Not valid avro after transfer
                            
                                org.apache.hadoop.mapred.LocalClientProtocolProvider not found
                            
                                Hbase master keeps dying, claims a hbase:namespace already exists
                            
                                Load large csv in hadoop via Hue would only store a 64MB block
                            
                                What is the difference between apache Ambari Server and Agent
                            
                                RHbase/thrift install issue
                            
                                Standard practices for logging in MapReduce jobs
                            
                                Hive transform using Python: Unable to initialize custom script
                            
                                Key of object type in the hadoop mapper
                            
                                Hadoop setting the HADOOP_HOME correctly to bin/hadoop it gives command not found
                            
                                Spark NotSerializableException
                            
                                What happens when the intermediate output does not fit in RAM in Spark
                            
                                Startin HBase Shell - Zookeeper exists but fails

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Why my BroadcastHashJoin is slower than ShuffledHashJoin in Spark

Tags:

apache-spark

hadoop

hive

Fabio

People also ask

1 Answers

zero323

Recent Activity

Donate For Us