How can I tell if my spark job is progressing?

Tags:

I have a spark job running on YARN and it appears to just hang and not be doing any computation.

Here's what yarn says when I do yarn application -status <APPLICATIOM ID>:

Application Report : 
Application-Id : applicationID
Application-Name : test app
Application-Type : SPARK
User : ec2-user
Queue : default
Start-Time : 1491005660004
Finish-Time : 0
Progress : 10%
State : RUNNING
Final-State : UNDEFINED
Tracking-URL : http://<ip>:4040
RPC Port : 0
AM Host : <host ip>
Aggregate Resource Allocation : 36343926 MB-seconds, 9818 vcore-seconds
Log Aggregation Status : NOT_START
Diagnostics :

And, when I check the yarn application -list it says that it is RUNNING. But I'm not sure I trust that. When I go to the spark webUI, I see only one stage the entire few hours I've been running it:

web UI

Also, when I click on the "Stages" tab, I see nothing running:

Stages tab

How do ensure that my application is actually running and that YARN is not lying to me?

I would actually prefer for this to throw an error rather than keep me waiting to see if the job is actaully runing. How do I do that?

665

asked Apr 01 '17 02:04

makansij

2 Answers

On the spark application UI

If you click on the link : "parquet at Nativexxxx" it would show you Details for the running stage.

On that screen there would be a column "Input Size/Records". If your job is progressing the number shown in that column would change.

Image shows what I am trying to say

It basically depicts number of records read by your executor.

186

answered Oct 15 '22 19:10

Sanchit Grover

if you go to Spark UI and search for "executors" tab. There you will have the list executors that your job is running on and next to executor ID and address you will have "logs" column there you will have "stdout" & "stderr" tabs. Click on stdout and there you can see the logs those were written on your container when your job is running.

answered Oct 15 '22 18:10

BadBoy777

Related questions
                            
                                Can Spark read data directly into a nested case class?
                            
                                Using airflow to run spark streaming jobs?
                            
                                Should cache and checkpoint be used together on DataSets? If so, how does this work under the hood?
                            
                                PySpark; DecimalType multiplication precision loss
                            
                                Understanding parallelism in Spark and Scala
                            
                                How to read XML files from apache spark framework?
                            
                                Change hadoop version using spark-ec2
                            
                                Spark SQL HiveContext - saveAsTable creates wrong schema
                            
                                Iterate through a Java RDD by row
                            
                                Is Spark zipWithIndex safe with parallel implementation?
                            
                                spark submit java.lang.ClassNotFoundException
                            
                                Differentiate driver code and work code in Apache Spark
                            
                                Returning Multiple Arrays from User-Defined Aggregate Function (UDAF) in Apache Spark SQL
                            
                                Unit testing with Spark dataframes
                            
                                Apache spark Hive, executable JAR with maven shade
                            
                                Non linear (DAG) ML pipelines in Apache Spark
                            
                                Pyspark socket timeout exception after application running for a while
                            
                                Share config files with spark-submit in cluster mode
                            
                                Writing a sparkdataframe to a .csv file in S3 and choose a name in pyspark
                            
                                How to exclude jar in final sbt assembly plugin

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How can I tell if my spark job is progressing?

Tags:

apache-spark

pyspark

hadoop-yarn

makansij

People also ask

2 Answers

Sanchit Grover

BadBoy777

Recent Activity

Donate For Us