We are consuming from Kafka using structured streaming and writing the processed data set to s3.
We also want to write the processed data to Kafka moving forward, is it possible to do it from the same streaming query ? (spark version 2.1.1)
In the logs, I see the streaming query progress output and I have a sample duration JSON from the log, can some one please provide more clarity on what the difference is between addBatch
and getBatch
?
TriggerExecution - is it the time take to both process the fetched data and writing to the sink?
"durationMs" : {
"addBatch" : 2263426,
"getBatch" : 12,
"getOffset" : 273,
"queryPlanning" : 13,
"triggerExecution" : 2264288,
"walCommit" : 552
},
Duplicate rows could be remove or drop from Spark SQL DataFrame using distinct() and dropDuplicates() functions, distinct() can be used to remove rows that have the same values on all columns whereas dropDuplicates() can be used to remove rows that have the same values on multiple selected columns.
Sink is the extension of the BaseStreamingSink contract for streaming sinks that can add batches to an output. Sink is part of Data Source API V1 and used in Micro-Batch Stream Processing only.
Spark receives real-time data and divides it into smaller batches for the execution engine. In contrast, Structured Streaming is built on the SparkSQL API for data stream processing. In the end, all the APIs are optimized using Spark catalyst optimizer and translated into RDDs for execution under the hood.
exactly once semantics are only possible if the source is re-playable and the sink is idempotent.
Yes.
In Spark 2.1.1, you can use writeStream.foreach
to write your data into Kafka. There is an example in this blog: https://databricks.com/blog/2017/04/04/real-time-end-to-end-integration-with-apache-kafka-in-apache-sparks-structured-streaming.html
Or you can use Spark 2.2.0 which adds Kafka sink to support writing to Kafka officially.
getBatch
measures how long to create a DataFrame from source. This is usually pretty fast. addBatch
measures how long to run the DataFrame in a sink.
triggerExecution
measures how long to run a trigger execution, is usually almost the same as getOffset
+ getBatch
+ addBatch
.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With