I'm using Spark SQL 1.6.1 and am performing a few joins. Looking at the spark UI I see that there are some jobs with description "run at ThreadPoolExecutor.java:1142" <img src="https://i.stack.imgur.com/krtxU.png" alt="Example of some of these jobs"> I was wondering why do some Spark jobs get that description?

After some investigation I found out that run at ThreadPoolExecutor.java:1142 Spark jobs are related to queries with <code>join</code> operators that fit the definition of <code>BroadcastHashJoin</code> where one join side is broadcast to executors for join. That <code>BroadcastHashJoin</code> operator uses a <code>ThreadPool</code> for this asynchronous broadcasting (see this and this). <pre class="prettyprint"><code>scala> spark.version res16: String = 2.1.0-SNAPSHOT scala> val left = spark.range(1) left: org.apache.spark.sql.Dataset[Long] = [id: bigint] scala> val right = spark.range(1) right: org.apache.spark.sql.Dataset[Long] = [id: bigint] scala> left.join(right, Seq("id")).show +---+ | id| +---+ | 0| +---+ </code></pre> When you switch to the SQL tab you should see Completed Queries section and their Jobs (on the right). <img src="https://i.stack.imgur.com/cY4VH.png" alt="SQL tab in web UI with Completed Queries"> In my case the Spark job(s) running on "run at ThreadPoolExecutor.java:1142" where ids 12 and 16. <img src="https://i.stack.imgur.com/Dyxnp.png" alt='Jobs tab in web UI with "run at ThreadPoolExecutor.java:1142" jobs'> They both correspond to <code>join</code> queries. If you wonder "that makes sense that one of my joins is causing this job to appear but as far as I know join is a shuffle transformation and not an action, so why is the job described with the ThreadPoolExecutor and not with my action (as is the case with the rest of my jobs)?", then my answer is usually along the lines: Spark SQL is an extension of Spark with its own abstractions (<code>Dataset</code>s to name just the one that quickly springs to mind) that have their own operators for execution. One "simple" SQL operation can run one or more Spark jobs. It's at the discretion of Spark SQL's execution engine how many Spark jobs to run or submit (but they do use RDDs under the covers) -- you don't have to know such a low-leve details as it's...well...too low-level...given you are so high-level by using Spark SQL's SQL or Query DSL.

What are ThreadPoolExecutors jobs in web UI's Spark Jobs?

1 Answers

After some investigation I found out that run at ThreadPoolExecutor.java:1142 Spark jobs are related to queries with join operators that fit the definition of BroadcastHashJoin where one join side is broadcast to executors for join.

That BroadcastHashJoin operator uses a ThreadPool for this asynchronous broadcasting (see this and this).

scala> spark.version res16: String = 2.1.0-SNAPSHOT  scala> val left = spark.range(1) left: org.apache.spark.sql.Dataset[Long] = [id: bigint]  scala> val right = spark.range(1) right: org.apache.spark.sql.Dataset[Long] = [id: bigint]  scala> left.join(right, Seq("id")).show +---+ | id| +---+ |  0| +---+

When you switch to the SQL tab you should see Completed Queries section and their Jobs (on the right).

SQL tab in web UI with Completed Queries

In my case the Spark job(s) running on "run at ThreadPoolExecutor.java:1142" where ids 12 and 16.

Jobs tab in web UI with "run at ThreadPoolExecutor.java:1142" jobs

They both correspond to join queries.

If you wonder "that makes sense that one of my joins is causing this job to appear but as far as I know join is a shuffle transformation and not an action, so why is the job described with the ThreadPoolExecutor and not with my action (as is the case with the rest of my jobs)?", then my answer is usually along the lines:

Spark SQL is an extension of Spark with its own abstractions (Datasets to name just the one that quickly springs to mind) that have their own operators for execution. One "simple" SQL operation can run one or more Spark jobs. It's at the discretion of Spark SQL's execution engine how many Spark jobs to run or submit (but they do use RDDs under the covers) -- you don't have to know such a low-leve details as it's...well...too low-level...given you are so high-level by using Spark SQL's SQL or Query DSL.

192

answered Oct 05 '22 15:10

Jacek Laskowski

Related questions
                            
                                Append empty row to a dataframe
                            
                                How to check whether the check box is checked or not capybara Rspec
                            
                                Jekyll on GitHub Pages: include markdown in another markdown file
                            
                                How to run a file in IPython console as default instead of terminal?
                            
                                SQLAlchemy: unexpected results when using `and` and `or`
                            
                                How does Lombok.val actually work?
                            
                                java.util.Arrays.asList when used with removeIf throws UnsupportedOperationException
                            
                                Add csv file to HTTP POST
                            
                                Customize icon for "Add to home screen"
                            
                                How to add an existing project to TFS
                            
                                How to know my permissions for bitbucket repo?
                            
                                pandas check if column is null with query function

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

What are ThreadPoolExecutors jobs in web UI's Spark Jobs?

Tags:

Gideon

People also ask

1 Answers

Jacek Laskowski

Recent Activity

Donate For Us