How to know which stage of a job is currently running in Apache Spark?

Tags:

Consider I have a job as follow in Spark;

CSV File ==> Filter By A Column ==> Taking Sample ==> Save As JSON

Now my requirement is how do I know which step(Fetching file or Filtering or Sampling) of the job is currently executing programatically (Preferably using Java API)? Is there any way for this?

I can track Job,Stage and Task using SparkListener class. And it can be done like tracking a stage Id. But how to know which stage Id is for which step in the job chain.

What I want to send a notification to user when consider Filter By A Column is completed. For that I made a class that extends SparkListener class. But I can not find out from where I can get the name of currently executing transformation name. Is it possible to track at all?

public class ProgressListener extends SparkListener{

  @Override
  public void onJobStart(SparkListenerJobStart jobStart)
  {

  }

  @Override
  public void onStageSubmitted(SparkListenerStageSubmitted stageSubmitted)
  {
      //System.out.println("Stage Name : "+stageSubmitted.stageInfo().getStatusString()); giving action name only
  }

  @Override
  public void onTaskStart(SparkListenerTaskStart taskStart)
  {
      //no such method like taskStart.name()
  }
}

294

asked Feb 14 '17 11:02

KOUSIK MANDAL

1 Answers

You cannot exactly know when, e.g., the filter operation starts or finishes.

That's because you have transformations (filter,map,...) and actions (count, foreach,...). Spark will put as many operations into one stage as possible. Then the stage is executed in parallel on the different partitions of your input. And here comes the problem.

Assume you have several workers and the following program

LOAD ==> MAP ==> FILTER ==> GROUP BY + Aggregation

This program will probably have two stages: the first stage will load the file and apply the map and filter. Then the output will be shuffled to create the groups. In the second stage the aggregation will be performed.

Now, the problem is, that you have several workers and each will process a portion of your input data in parallel. That is, every executor in your cluster will receive a copy of your program(the current stage) and execute this on the assigned partition.

You see, you will have multiple instances of your map and filter operators that are executed in parallel, but not necessarily at the same time. In an extreme case, worker 1 will finish with stage 1 before worker 20 has started at all (and therefore finish with its filter operation before worker 20).

For RDDs Spark uses the iterator model inside a stage. For Datasets in the latest Spark version however, they create a single loop over the partition and execute the transformations. This means that in this case Spark itself does not really know when a transformation operator finished for a single task!

Long story short:

You are not able the know when an operation inside a stage finishes
Even if you could, there are multiple instances that will finish at different times.

So, now I already had the same problem:

In our Piglet project (please allow some adverstisement ;-) ) we generate Spark code from Pig Latin scripts and wanted to profile the scripts. I ended up in inserting mapPartition operator between all user operators that will send the partition ID and the current time to a server which will evaluate the messages. However, this solution also has its limitations... and I'm not completely satisfied yet.

However, unless you are able to modify the programs I'm afraid you cannot achieve what you want.

116

answered Oct 17 '22 06:10

hage

Related questions
                            
                                StringBuilder insert() vs append() performance?
                            
                                Intellij Completion Contributor
                            
                                Cannot parse XML message with JAXB org.springframework.oxm.UnmarshallingFailureException
                            
                                How to update only attributes that have changed - Spring MVC
                            
                                Spring Data Mongodb: How to Dump Raw Query/Commands generated through QueryDsl?
                            
                                Puppetserver service fails to start
                            
                                Removing minimum no of edges to disconnect two vertices in a graph
                            
                                How to set RequestConfiguration per request using RestTemplate?
                            
                                How can I set videos to "private yet shared" using the v3 YouTube API?
                            
                                What is the equivalent of BuildConfig in Java module? - Android
                            
                                Play framework. Session becomes null after redirection from google sign in page
                            
                                HK2 failure has been detected in a code that does not run in an active Jersey Error scope
                            
                                I get ClassCast exception when I enumerate vector with String type parameter, but no exception is there with Integer as type parameter
                            
                                Dictionary using Red-Black tree - deletion error
                            
                                Getting unencoded data from Google cloud Pub/Sub instead of base64
                            
                                How to find the minimum JDK version of a specific version of maven dependency?
                            
                                Lazy Loading not working properly
                            
                                Intellij IDEA format JavaDoc
                            
                                Deserialize complex JSON to Java, classes nested multiple levels deep
                            
                                Intellij Window open Minimized

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to know which stage of a job is currently running in Apache Spark?

Tags:

java

scala

apache-spark

bigdata

KOUSIK MANDAL

People also ask

1 Answers

hage

Recent Activity

Donate For Us