<ol> <li> We are consuming from Kafka using structured streaming and writing the processed data set to s3. We also want to write the processed data to Kafka moving forward, is it possible to do it from the same streaming query ? (spark version 2.1.1) </li> <li>In the logs, I see the streaming query progress output and I have a sample duration JSON from the log, can some one please provide more clarity on what the difference is between <code>addBatch</code> and <code>getBatch</code>? </li> <li> TriggerExecution - is it the time take to both process the fetched data and writing to the sink? <pre class="prettyprint"><code>"durationMs" : { "addBatch" : 2263426, "getBatch" : 12, "getOffset" : 273, "queryPlanning" : 13, "triggerExecution" : 2264288, "walCommit" : 552 }, </code></pre> </li> </ol>

<ol> <li> Yes. In Spark 2.1.1, you can use <code>writeStream.foreach</code> to write your data into Kafka. There is an example in this blog: https://databricks.com/blog/2017/04/04/real-time-end-to-end-integration-with-apache-kafka-in-apache-sparks-structured-streaming.html Or you can use Spark 2.2.0 which adds Kafka sink to support writing to Kafka officially. </li> <li><code>getBatch</code> measures how long to create a DataFrame from source. This is usually pretty fast. <code>addBatch</code> measures how long to run the DataFrame in a sink.</li> <li><code>triggerExecution</code> measures how long to run a trigger execution, is usually almost the same as <code>getOffset</code> + <code>getBatch</code> + <code>addBatch</code>.</li> </ol>

Spark Structured streaming: multiple sinks

Tags:

apache-spark

spark-structured-streaming

We are consuming from Kafka using structured streaming and writing the processed data set to s3.

We also want to write the processed data to Kafka moving forward, is it possible to do it from the same streaming query ? (spark version 2.1.1)
In the logs, I see the streaming query progress output and I have a sample duration JSON from the log, can some one please provide more clarity on what the difference is between addBatch and getBatch?

TriggerExecution - is it the time take to both process the fetched data and writing to the sink?

"durationMs" : {
    "addBatch" : 2263426,
    "getBatch" : 12,
    "getOffset" : 273,
   "queryPlanning" : 13,
    "triggerExecution" : 2264288,
    "walCommit" : 552
},

249

asked Aug 11 '17 19:08

user2221654

1 Answers

Yes.

In Spark 2.1.1, you can use writeStream.foreach to write your data into Kafka. There is an example in this blog: https://databricks.com/blog/2017/04/04/real-time-end-to-end-integration-with-apache-kafka-in-apache-sparks-structured-streaming.html

Or you can use Spark 2.2.0 which adds Kafka sink to support writing to Kafka officially.
getBatch measures how long to create a DataFrame from source. This is usually pretty fast. addBatch measures how long to run the DataFrame in a sink.
triggerExecution measures how long to run a trigger execution, is usually almost the same as getOffset + getBatch + addBatch.

110

answered Sep 21 '22 14:09

zsxwing

Related questions
                            
                                Spark Streaming Accumulated Word Count
                            
                                Saving to parquet subpartition
                            
                                How do I apply schema with nullable = false to json reading
                            
                                Why does the Spark DataFrame conversion to RDD require a full re-mapping?
                            
                                PySpark distributed processing on a YARN cluster
                            
                                How do I visualise / plot a decision tree in Apache Spark (PySpark 1.4.1)?
                            
                                Where does spark look for text files?
                            
                                Spark DataFrame InsertIntoJDBC - TableAlreadyExists Exception
                            
                                How to pass data from Kafka to Spark Streaming?
                            
                                Spark Driver Memory and Executor Memory
                            
                                Retain keys with null values while writing JSON in spark
                            
                                How to detect Databricks environment programmatically
                            
                                Apache Spark: Job aborted due to stage failure: "TID x failed for unknown reasons"
                            
                                How to convert spark SchemaRDD into RDD of my case class?
                            
                                "No Filesystem for Scheme: gs" when running spark job locally
                            
                                Running Spark jobs on a YARN cluster with additional files
                            
                                Append a new column to an existing parquet file
                            
                                Spark reading python3 pickle as input
                            
                                Why do columns change to nullable in Apache Spark SQL?
                            
                                Save and load two ML models in pyspark

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With