I'm using Spark SQL 1.6.1 and am performing a few joins.
Looking at the spark UI I see that there are some jobs with description "run at ThreadPoolExecutor.java:1142"
I was wondering why do some Spark jobs get that description?
Apache Spark provides a suite of Web UI/User Interfaces (Jobs, Stages, Tasks, Storage, Environment, Executors, and SQL) to monitor the status of your Spark/PySpark application, resource consumption of Spark cluster, and Spark configurations.
Apache Spark provides a suite of web user interfaces (UIs) that you can use to monitor the status and resource consumption of your Spark cluster. Apache Spark provides the following UIs: Master web UI. Worker web UI.
The SQL tab in the Spark UI provides a lot of information for analysing your spark queries, ranging from the query plan, to all associated statistics. However, many new Spark practitioners get overwhelmed by the information presented, and have trouble using it to their benefit.
After some investigation I found out that run at ThreadPoolExecutor.java:1142 Spark jobs are related to queries with join
operators that fit the definition of BroadcastHashJoin
where one join side is broadcast to executors for join.
That BroadcastHashJoin
operator uses a ThreadPool
for this asynchronous broadcasting (see this and this).
scala> spark.version res16: String = 2.1.0-SNAPSHOT scala> val left = spark.range(1) left: org.apache.spark.sql.Dataset[Long] = [id: bigint] scala> val right = spark.range(1) right: org.apache.spark.sql.Dataset[Long] = [id: bigint] scala> left.join(right, Seq("id")).show +---+ | id| +---+ | 0| +---+
When you switch to the SQL tab you should see Completed Queries section and their Jobs (on the right).
In my case the Spark job(s) running on "run at ThreadPoolExecutor.java:1142" where ids 12 and 16.
They both correspond to join
queries.
If you wonder "that makes sense that one of my joins is causing this job to appear but as far as I know join is a shuffle transformation and not an action, so why is the job described with the ThreadPoolExecutor and not with my action (as is the case with the rest of my jobs)?", then my answer is usually along the lines:
Spark SQL is an extension of Spark with its own abstractions (Dataset
s to name just the one that quickly springs to mind) that have their own operators for execution. One "simple" SQL operation can run one or more Spark jobs. It's at the discretion of Spark SQL's execution engine how many Spark jobs to run or submit (but they do use RDDs under the covers) -- you don't have to know such a low-leve details as it's...well...too low-level...given you are so high-level by using Spark SQL's SQL or Query DSL.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With