Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Spark Structured streaming: multiple sinks

  1. We are consuming from Kafka using structured streaming and writing the processed data set to s3.

    We also want to write the processed data to Kafka moving forward, is it possible to do it from the same streaming query ? (spark version 2.1.1)

  2. In the logs, I see the streaming query progress output and I have a sample duration JSON from the log, can some one please provide more clarity on what the difference is between addBatch and getBatch?

  3. TriggerExecution - is it the time take to both process the fetched data and writing to the sink?

    "durationMs" : {
        "addBatch" : 2263426,
        "getBatch" : 12,
        "getOffset" : 273,
       "queryPlanning" : 13,
        "triggerExecution" : 2264288,
        "walCommit" : 552
    },
    
like image 249
user2221654 Avatar asked Aug 11 '17 19:08

user2221654


People also ask

How does Spark handle duplicates in streaming?

Duplicate rows could be remove or drop from Spark SQL DataFrame using distinct() and dropDuplicates() functions, distinct() can be used to remove rows that have the same values on all columns whereas dropDuplicates() can be used to remove rows that have the same values on multiple selected columns.

What is sink in Spark streaming?

Sink is the extension of the BaseStreamingSink contract for streaming sinks that can add batches to an output. Sink is part of Data Source API V1 and used in Micro-Batch Stream Processing only.

What is the difference between Spark streaming and structured streaming?

Spark receives real-time data and divides it into smaller batches for the execution engine. In contrast, Structured Streaming is built on the SparkSQL API for data stream processing. In the end, all the APIs are optimized using Spark catalyst optimizer and translated into RDDs for execution under the hood.

Which property must a Spark structured streaming sink possess to ensure end to end exactly once semantics?

exactly once semantics are only possible if the source is re-playable and the sink is idempotent.


1 Answers

  1. Yes.

    In Spark 2.1.1, you can use writeStream.foreach to write your data into Kafka. There is an example in this blog: https://databricks.com/blog/2017/04/04/real-time-end-to-end-integration-with-apache-kafka-in-apache-sparks-structured-streaming.html

    Or you can use Spark 2.2.0 which adds Kafka sink to support writing to Kafka officially.

  2. getBatch measures how long to create a DataFrame from source. This is usually pretty fast. addBatch measures how long to run the DataFrame in a sink.

  3. triggerExecution measures how long to run a trigger execution, is usually almost the same as getOffset + getBatch + addBatch.

like image 110
zsxwing Avatar answered Sep 21 '22 14:09

zsxwing