I have a simple Spark Structured Streaming app that reads from Kafka and writes to HDFS. Today the app has mysteriously stopped working, with no changes or modifications whatsoever (it had been working flawlessly for weeks). So far, I have observed the following: <ul> <li>App has no active, failed or completed tasks</li> <li>App UI shows no jobs and no stages </li> <li>QueryProgress indicates 0 input rows every trigger</li> <li>QueryProgress indicates offsets from Kafka were read and committed correctly (which means data is actually there)</li> <li>Data is indeed available in the topic (writing to console shows the data)</li> </ul> Despite all of that, nothing is being written to HDFS anymore. Code snippet: <pre class="prettyprint lang-scala prettyprint-override"><code>val inputData = spark .readStream.format("kafka") .option("kafka.bootstrap.servers", bootstrap_servers) .option("subscribe", topic-name-here") .option("startingOffsets", "latest") .option("failOnDataLoss", "false").load() inputData.toDF() .repartition(10) .writeStream.format("parquet") .option("checkpointLocation", "hdfs://...") .option("path", "hdfs://...") .outputMode(OutputMode.Append()) .trigger(Trigger.ProcessingTime("60 seconds")) .start() </code></pre> Any ideas why the UI shows no jobs/tasks? <img src="https://i.stack.imgur.com/eFg6J.png" alt="No jobs for the application"> <img src="https://i.stack.imgur.com/uNoxx.png" alt="No tasks and basically no activity"> <img src="https://i.stack.imgur.com/Wz4EI.png" alt="Query Progress">

For anyone facing the same issue: I found the culprit: Somehow the data within _spark_metadata in the HDFS directory where I was saving the data got corrupted. The solution was to erase that directory and restart the application, which re-created the directory. After data, data started flowing.

Spark Structured Streaming app has no jobs and no stages

Tags:

apache-kafka

apache-spark

spark-structured-streaming

I have a simple Spark Structured Streaming app that reads from Kafka and writes to HDFS. Today the app has mysteriously stopped working, with no changes or modifications whatsoever (it had been working flawlessly for weeks).

So far, I have observed the following:

App has no active, failed or completed tasks
App UI shows no jobs and no stages
QueryProgress indicates 0 input rows every trigger
QueryProgress indicates offsets from Kafka were read and committed correctly (which means data is actually there)
Data is indeed available in the topic (writing to console shows the data)

Despite all of that, nothing is being written to HDFS anymore. Code snippet:

val inputData = spark
.readStream.format("kafka")
.option("kafka.bootstrap.servers", bootstrap_servers)
.option("subscribe", topic-name-here")
.option("startingOffsets", "latest")
.option("failOnDataLoss", "false").load()

inputData.toDF()
.repartition(10)
.writeStream.format("parquet")
.option("checkpointLocation", "hdfs://...")
.option("path", "hdfs://...")
.outputMode(OutputMode.Append())
.trigger(Trigger.ProcessingTime("60 seconds"))
.start()

Any ideas why the UI shows no jobs/tasks?

No jobs for the application

No tasks and basically no activity

Query Progress

250

asked Apr 04 '18 19:04

Ander Murillo Zohn

1 Answers

For anyone facing the same issue: I found the culprit:

Somehow the data within _spark_metadata in the HDFS directory where I was saving the data got corrupted.

The solution was to erase that directory and restart the application, which re-created the directory. After data, data started flowing.

164

answered Oct 31 '22 02:10

Ander Murillo Zohn

Related questions
                            
                                How to time Spark program execution speed
                            
                                spark importing data from oracle - java.lang.ClassNotFoundException: oracle.jdbc.driver.OracleDriver
                            
                                Does Spark Supports With Clause?
                            
                                Spark persist temp view
                            
                                Spark job failing due to space issue
                            
                                How to deal with array<String> in spark dataframe?
                            
                                Low cpu usage while running a spark job
                            
                                How to use a predicate while reading from JDBC connection?
                            
                                using DataSet.repartition in Spark 2 - several tasks handle more than one partition
                            
                                Does CrossValidator in PySpark distribute the execution?
                            
                                Spark, Scala - How to get Top 3 value from each group of two column in dataframe [duplicate]
                            
                                PATH issue: Could not find valid SPARK_HOME while searching
                            
                                How to (equally) partition array-data in spark dataframe
                            
                                Spark UDF not running in parallel
                            
                                Spark window function on dataframe with large number of columns
                            
                                Passing multiple system properties to google dataproc cluster job
                            
                                What is the difference between a "stateful" and "stateless" system?
                            
                                Xml processing in Spark
                            
                                How to pass variables in spark SQL, using python?
                            
                                Difference when serializing a lazy val with or without @transient

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Spark Structured Streaming app has no jobs and no stages

Tags:

apache-kafka

apache-spark

spark-structured-streaming

Ander Murillo Zohn

People also ask

1 Answers

Ander Murillo Zohn

Recent Activity

Donate For Us