Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Continuous trigger not found in Structured Streaming

Runtime: Spark 2.3.0, Scala 2.11 (Databricks 4.1 ML beta)

import org.apache.spark.sql.streaming.Trigger
import scala.concurrent.duration._

//kafka settings and df definition goes here

val query = df.writeStream.format("parquet")
.option("path", ...)
.option("checkpointLocation",...)
.trigger(continuous(30000))
.outputMode(OutputMode.Append)
.start

Throws error not found: value continuous

Other attempts that did not work:

.trigger(continuous = "30 seconds") //as per Databricks blog
// throws same error as above

.trigger(Trigger.Continuous("1 second")) //as per Spark docs
// throws java.lang.IllegalStateException: Unknown type of trigger: ContinuousTrigger(1000)

References:

(Databricks Blog) https://databricks.com/blog/2018/03/20/low-latency-continuous-processing-mode-in-structured-streaming-in-apache-spark-2-3-0.html

(Spark guide) http://spark.apache.org/docs/2.3.0/structured-streaming-programming-guide.html#continuous-processing

(Scaladoc) https://spark.apache.org/docs/2.3.0/api/scala/index.html#org.apache.spark.sql.streaming.package

like image 343
maverik Avatar asked Jun 20 '18 15:06

maverik


People also ask

What is trigger in structured streaming?

In streaming systems, we need a special event to kick off processing, which is called a trigger. Let's discuss a few triggers in Spark Streaming. Default: Executes a micro-batch as soon as the previous finishes. Fixed interval micro-batches: Specifies the interval when the micro-batches will execute.

What is the role of trigger in Spark structured streaming?

Apache Spark provides the . trigger(once=True) option to process all new data from the source directory as a single micro-batch. This trigger once pattern ignores all setting to control streaming input size, which can lead to massive spill or out-of-memory errors.

What is new in structured streaming in Spark?

Apache Spark Structured Streaming is a near-real time processing engine that offers end-to-end fault tolerance with exactly-once processing guarantees using familiar Spark APIs. Structured Streaming lets you express computation on streaming data in the same way you express a batch computation on static data.

What is difference between DStream and structured streaming?

Internally, a DStream is a sequence of RDDs. Spark receives real-time data and divides it into smaller batches for the execution engine. In contrast, Structured Streaming is built on the SparkSQL API for data stream processing.


1 Answers

Spark 2.3.0 does not support parquet under continuous streams, you would have to use streams based on Kafka, console or memory.

To quote the continuous processing mode in structured streaming blog post:

You can set the optional Continuous Trigger in queries that satisfy the following conditions: Read from supported sources like Kafka and write to supported sinks like Kafka, memory, console.

like image 166
Javier Luraschi Avatar answered Oct 05 '22 21:10

Javier Luraschi