Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Spark Structured Streaming app has no jobs and no stages

I have a simple Spark Structured Streaming app that reads from Kafka and writes to HDFS. Today the app has mysteriously stopped working, with no changes or modifications whatsoever (it had been working flawlessly for weeks).

So far, I have observed the following:

  • App has no active, failed or completed tasks
  • App UI shows no jobs and no stages
  • QueryProgress indicates 0 input rows every trigger
  • QueryProgress indicates offsets from Kafka were read and committed correctly (which means data is actually there)
  • Data is indeed available in the topic (writing to console shows the data)

Despite all of that, nothing is being written to HDFS anymore. Code snippet:

val inputData = spark
.readStream.format("kafka")
.option("kafka.bootstrap.servers", bootstrap_servers)
.option("subscribe", topic-name-here")
.option("startingOffsets", "latest")
.option("failOnDataLoss", "false").load()

inputData.toDF()
.repartition(10)
.writeStream.format("parquet")
.option("checkpointLocation", "hdfs://...")
.option("path", "hdfs://...")
.outputMode(OutputMode.Append())
.trigger(Trigger.ProcessingTime("60 seconds"))
.start()

Any ideas why the UI shows no jobs/tasks?

No jobs for the application

No tasks and basically no activity

Query Progress

like image 250
Ander Murillo Zohn Avatar asked Apr 04 '18 19:04

Ander Murillo Zohn


People also ask

What is the difference between Spark streaming and structured streaming?

Spark receives real-time data and divides it into smaller batches for the execution engine. In contrast, Structured Streaming is built on the SparkSQL API for data stream processing. In the end, all the APIs are optimized using Spark catalyst optimizer and translated into RDDs for execution under the hood.

How does Spark structured streaming work?

Structured Streaming lets you express computation on streaming data in the same way you express a batch computation on static data. The Structured Streaming engine performs the computation incrementally and continuously updates the result as streaming data arrives.

Is Spark streaming deprecated?

Now that the Direct API of Spark Streaming (we currently have version 2.3. 2) is deprecated and we recently added the Confluent platform (comes with Kafka 2.2. 0) to our project we plan to migrate these applications.

What are streaming jobs in Spark?

Spark Streaming receives live input data streams and divides the data into batches, which are then processed by the Spark engine to generate the final stream of results in batches. Spark Streaming provides a high-level abstraction called discretized stream or DStream, which represents a continuous stream of data.


1 Answers

For anyone facing the same issue: I found the culprit:

Somehow the data within _spark_metadata in the HDFS directory where I was saving the data got corrupted.

The solution was to erase that directory and restart the application, which re-created the directory. After data, data started flowing.

like image 164
Ander Murillo Zohn Avatar answered Oct 31 '22 02:10

Ander Murillo Zohn