Trying to understand how Hive partitions relate to Spark partitions, culminating in a question about joins. I have 2 external Hive tables; both backed by S3 buckets and partitioned by <code>date</code>; so in each bucket there are keys with name format <code>date=<yyyy-MM-dd>/<filename></code>. Question 1: If I read this data into Spark: <pre class="prettyprint"><code>val table1 = spark.table("table1").as[Table1Row] val table2 = spark.table("table2").as[Table2Row] </code></pre> then how many partitions are the resultant datasets going to have respectively? Partitions equal to the number of objects in S3? Question 2: Suppose the two row types have the following schema: <pre class="prettyprint"><code>Table1Row(date: Date, id: String, ...) Table2Row(date: Date, id: String, ...) </code></pre> and that I want to join <code>table1</code> and <code>table2</code> on the fields <code>date</code> and <code>id</code>: <pre class="prettyprint"><code>table1.joinWith(table2, table1("date") === table2("date") && table1("id") === table2("id") ) </code></pre> Is Spark going to be able to utilize the fact that one of the fields being joined on is the partition key in the Hive tables to optimize the join? And if so how? Question 3: Suppose now that I am using <code>RDD</code>s instead: <pre class="prettyprint"><code>val rdd1 = table1.rdd val rdd2 = table2.rdd </code></pre> AFAIK, the syntax for the join using the <code>RDD</code> API would look something like: <pre class="prettyprint"><code>rdd1.map(row1 => ((row1.date, row1.id), row1)) .join(rdd2.map(row2 => ((row2.date, row2.id), row2)))) </code></pre> Again, is Spark going to be able to utilize the fact that the partition key in the Hive tables is being used in the join?

<blockquote> then how many partitions are the resultant datasets going to have respectively? Partitions equal to the number of objects in S3? </blockquote> Impossible to answer given information you've provided. Number of partitions in latest versions depends on primarily on <code>spark.sql.files.maxPartitionByte</code>, although other factors can play some role as well. <blockquote> Is Spark going to be able to utilize the fact that one of the fields being joined on is the partition key in the Hive tables to optimize the join? </blockquote> Not as of today (Spark 2.3.0), however Spark can utilize bucketing (<code>DISTRIBUTE BY</code>) to optimize joins. See How to define partitioning of DataFrame?. This might change in the future, once Data Source API v2 stabilizes. <blockquote> Suppose now that I am using RDDs instead (...) Again, is Spark going to be able to utilise the fact that the partition key in the Hive tables is being used in the join? </blockquote> Not at all. Even if data is bucketed RDD transformations and functional <code>Dataset</code> transformations are black boxes. No optimization can be applied and are applied here.

Hive partitions, Spark partitions and joins in Spark - how they relate

Tags:

Trying to understand how Hive partitions relate to Spark partitions, culminating in a question about joins.

I have 2 external Hive tables; both backed by S3 buckets and partitioned by date; so in each bucket there are keys with name format date=<yyyy-MM-dd>/<filename>.

Question 1:

If I read this data into Spark:

val table1 = spark.table("table1").as[Table1Row]
val table2 = spark.table("table2").as[Table2Row]

then how many partitions are the resultant datasets going to have respectively? Partitions equal to the number of objects in S3?

Question 2:

Suppose the two row types have the following schema:

Table1Row(date: Date, id: String, ...)
Table2Row(date: Date, id: String, ...)

and that I want to join table1 and table2 on the fields date and id:

table1.joinWith(table2,
  table1("date") === table2("date") && 
    table1("id") === table2("id")
)

Is Spark going to be able to utilize the fact that one of the fields being joined on is the partition key in the Hive tables to optimize the join? And if so how?

Question 3:

Suppose now that I am using RDDs instead:

val rdd1 = table1.rdd
val rdd2 = table2.rdd

AFAIK, the syntax for the join using the RDD API would look something like:

rdd1.map(row1 => ((row1.date, row1.id), row1))
  .join(rdd2.map(row2 => ((row2.date, row2.id), row2))))

Again, is Spark going to be able to utilize the fact that the partition key in the Hive tables is being used in the join?

603

asked Apr 25 '18 06:04

Mullefa

1 Answers

then how many partitions are the resultant datasets going to have respectively? Partitions equal to the number of objects in S3?

Impossible to answer given information you've provided. Number of partitions in latest versions depends on primarily on spark.sql.files.maxPartitionByte, although other factors can play some role as well.

Is Spark going to be able to utilize the fact that one of the fields being joined on is the partition key in the Hive tables to optimize the join?

Not as of today (Spark 2.3.0), however Spark can utilize bucketing (DISTRIBUTE BY) to optimize joins. See How to define partitioning of DataFrame?. This might change in the future, once Data Source API v2 stabilizes.

Suppose now that I am using RDDs instead (...) Again, is Spark going to be able to utilise the fact that the partition key in the Hive tables is being used in the join?

Not at all. Even if data is bucketed RDD transformations and functional Dataset transformations are black boxes. No optimization can be applied and are applied here.

151

answered Oct 02 '22 13:10

zero323

Related questions
                            
                                Overwrite Application Id Outside of Flavors and Build Types
                            
                                How to use a local typescript declaration file for an installed module?
                            
                                vaadin 8 FilesystemContainer alternative
                            
                                Default disabled button click event is not firing on reactjs
                            
                                xUnit.net IsType Equivalent in MS Test That Returns Type
                            
                                Why is the first character in the CharBuffer returned by ByteBuffer::asCharBuffer always a space?
                            
                                Still getting error in dev.off() : cannot shut down device 1 (the null device)
                            
                                Elm and Radio button wonkiness
                            
                                Repository method sets LiveData value inside Asynchronous Retrofit call
                            
                                Multi-colored line chart with google visualization
                            
                                Need to hide row of ngx-datatable based on row index
                            
                                Firebase Hosting deploy only to sub directory [duplicate]

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With