Runtime: Spark 2.3.0, Scala 2.11 (Databricks 4.1 ML beta)
import org.apache.spark.sql.streaming.Trigger
import scala.concurrent.duration._
//kafka settings and df definition goes here
val query = df.writeStream.format("parquet")
.option("path", ...)
.option("checkpointLocation",...)
.trigger(continuous(30000))
.outputMode(OutputMode.Append)
.start
Throws error not found: value continuous
Other attempts that did not work:
.trigger(continuous = "30 seconds") //as per Databricks blog
// throws same error as above
.trigger(Trigger.Continuous("1 second")) //as per Spark docs
// throws java.lang.IllegalStateException: Unknown type of trigger: ContinuousTrigger(1000)
References:
(Databricks Blog) https://databricks.com/blog/2018/03/20/low-latency-continuous-processing-mode-in-structured-streaming-in-apache-spark-2-3-0.html
(Spark guide) http://spark.apache.org/docs/2.3.0/structured-streaming-programming-guide.html#continuous-processing
(Scaladoc) https://spark.apache.org/docs/2.3.0/api/scala/index.html#org.apache.spark.sql.streaming.package
In streaming systems, we need a special event to kick off processing, which is called a trigger. Let's discuss a few triggers in Spark Streaming. Default: Executes a micro-batch as soon as the previous finishes. Fixed interval micro-batches: Specifies the interval when the micro-batches will execute.
Apache Spark provides the . trigger(once=True) option to process all new data from the source directory as a single micro-batch. This trigger once pattern ignores all setting to control streaming input size, which can lead to massive spill or out-of-memory errors.
Apache Spark Structured Streaming is a near-real time processing engine that offers end-to-end fault tolerance with exactly-once processing guarantees using familiar Spark APIs. Structured Streaming lets you express computation on streaming data in the same way you express a batch computation on static data.
Internally, a DStream is a sequence of RDDs. Spark receives real-time data and divides it into smaller batches for the execution engine. In contrast, Structured Streaming is built on the SparkSQL API for data stream processing.
Spark 2.3.0 does not support parquet under continuous streams, you would have to use streams based on Kafka, console or memory.
To quote the continuous processing mode in structured streaming blog post:
You can set the optional Continuous Trigger in queries that satisfy the following conditions: Read from supported sources like Kafka and write to supported sinks like Kafka, memory, console.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With