I'm writing a test application that consumes messages from Kafka's topcis and then push data into S3 and into RDBMS tables (flow is similar to presented here: https://databricks.com/blog/2017/04/26/processing-data-in-apache-kafka-with-structured-streaming-in-apache-spark-2-2.html). So I read data from Kafka and then: <ul> <li>each message want to save into S3</li> <li>some messages save to table A in an external database (based on filter condition)</li> <li>some other messages save to table B in an external database (other filter condition)</li> </ul> So I have sth like: <pre class="prettyprint lang-java prettyprint-override"><code>Dataset<Row> df = spark .readStream() .format("kafka") .option("kafka.bootstrap.servers", "host1:port1,host2:port2") .option("subscribe", "topic1,topic2,topic3") .option("startingOffsets", "earliest") .load() .select(from_json(col("value").cast("string"), schema, jsonOptions).alias("parsed_value")) </code></pre> (please notice that I'm reading from more than one Kafka topic). Next I define required datasets: <pre class="prettyprint lang-java prettyprint-override"><code>Dataset<Row> allMessages = df.select(.....) Dataset<Row> messagesOfType1 = df.select() //some unique conditions applied on JSON elements Dataset<Row> messagesOfType2 = df.select() //some other unique conditions </code></pre> and now for each Dataset I create query to start processing: <pre class="prettyprint lang-java prettyprint-override"><code>StreamingQuery s3Query = allMessages .writeStream() .format("parquet") .option("startingOffsets", "latest") .option("path", "s3_location") .start() StreamingQuery firstQuery = messagesOfType1 .writeStream() .foreach(new CustomForEachWiriterType1()) // class that extends ForeachWriter[T] and save data into external RDBMS table .start(); StreamingQuery secondQuery = messagesOfType2 .writeStream() .foreach(new CustomForEachWiriterType2()) // class that extends ForeachWriter[T] and save data into external RDBMS table (may be even another database than before) .start(); </code></pre> Now I'm wondering: Will be those queries executed in parallel (or one after another in FIFO order and I should assign those queries to separate scheduler pools)?

<blockquote> Will be those queries executed in parallel </blockquote> Yes. These queries are going to be executed in parallel (every <code>trigger</code> which you did not specify and hence is to run them as fast as possible). <hr> Internally, when you execute <code>start</code> on a DataStreamWriter, you create a <code>StreamExecution</code> that in turn creates immediately so-called daemon <code>microBatchThread</code> (quoted from the Spark source code below): <pre class="prettyprint"><code> val microBatchThread = new StreamExecutionThread(s"stream execution thread for $prettyIdString") { override def run(): Unit = { // To fix call site like "run at <unknown>:0", we bridge the call site from the caller // thread to this micro batch thread sparkSession.sparkContext.setCallSite(callSite) runBatches() } } </code></pre> You can see every query in its own thread with name: <pre class="prettyprint"><code>stream execution thread for [prettyIdString] </code></pre> You can check the separate threads using jstack or jconsole.

How does Structured Streaming execute separate streaming queries (in parallel or sequentially)?

Tags:

apache-spark

spark-structured-streaming

I'm writing a test application that consumes messages from Kafka's topcis and then push data into S3 and into RDBMS tables (flow is similar to presented here: https://databricks.com/blog/2017/04/26/processing-data-in-apache-kafka-with-structured-streaming-in-apache-spark-2-2.html). So I read data from Kafka and then:

each message want to save into S3
some messages save to table A in an external database (based on filter condition)
some other messages save to table B in an external database (other filter condition)

So I have sth like:

Dataset<Row> df = spark
.readStream()
.format("kafka")
.option("kafka.bootstrap.servers", "host1:port1,host2:port2")
.option("subscribe", "topic1,topic2,topic3")
.option("startingOffsets", "earliest")
.load()
.select(from_json(col("value").cast("string"), schema, jsonOptions).alias("parsed_value"))

(please notice that I'm reading from more than one Kafka topic). Next I define required datasets:

Dataset<Row> allMessages = df.select(.....)
Dataset<Row> messagesOfType1 = df.select() //some unique conditions applied on JSON elements
Dataset<Row> messagesOfType2 = df.select() //some other unique conditions

and now for each Dataset I create query to start processing:

StreamingQuery s3Query = allMessages
.writeStream()
.format("parquet")
.option("startingOffsets", "latest")
.option("path", "s3_location")
.start()

StreamingQuery firstQuery = messagesOfType1
.writeStream()
.foreach(new CustomForEachWiriterType1()) // class that extends ForeachWriter[T] and save data into external RDBMS table
.start();

StreamingQuery secondQuery = messagesOfType2
.writeStream()
.foreach(new CustomForEachWiriterType2()) // class that extends ForeachWriter[T] and save data into external RDBMS table (may be even another database than before)
.start();

Now I'm wondering:

Will be those queries executed in parallel (or one after another in FIFO order and I should assign those queries to separate scheduler pools)?

955

asked May 14 '17 10:05

mm112

1 Answers

Will be those queries executed in parallel

Yes. These queries are going to be executed in parallel (every trigger which you did not specify and hence is to run them as fast as possible).

Internally, when you execute start on a DataStreamWriter, you create a StreamExecution that in turn creates immediately so-called daemon microBatchThread (quoted from the Spark source code below):

  val microBatchThread =
    new StreamExecutionThread(s"stream execution thread for $prettyIdString") {
      override def run(): Unit = {
        // To fix call site like "run at <unknown>:0", we bridge the call site from the caller
        // thread to this micro batch thread
        sparkSession.sparkContext.setCallSite(callSite)
        runBatches()
      }
    }

You can see every query in its own thread with name:

stream execution thread for [prettyIdString]

You can check the separate threads using jstack or jconsole.

100

answered Oct 25 '22 01:10

Jacek Laskowski

Related questions
                            
                                AWS EMR and Spark 1.0.0
                            
                                Apache spark in memory caching
                            
                                How to load directory of JSON files into Apache Spark in Python
                            
                                How to submit spark job from within java program to standalone spark cluster without using spark-submit?
                            
                                Apache Spark GraphX connected components
                            
                                What are Spark RDD graph, lineage graph, DAG of Spark tasks? what are their relations
                            
                                Cassandra timeout during read query at consistency ONE (1 responses were required but only 0 replica responded)
                            
                                What is the equivalent to scala.util.Try in pyspark?
                            
                                Google Cloud Dataproc configuration issues
                            
                                Feature normalization algorithm in Spark
                            
                                Joining a large and a ginormous spark dataframe
                            
                                How to properly wait for apache spark launcher job during launching it from another application?
                            
                                Using Futures within Spark
                            
                                How to execute a SQL query against ElasticSearch (using org.elasticsearch.spark.sql format)?
                            
                                Simple command for extracting column names in sparklyr (R+spark)
                            
                                Spark - Reading JSON from Partitioned Folders using Firehose
                            
                                spark dataframe trim column and convert
                            
                                Partitioning with Spark Graphframes
                            
                                PySpark: do I need to re-cache a DataFrame?
                            
                                spark programming: best way to organize context imports and others with multiple functions

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With