How to start multiple streaming queries in a single Spark application?

Tags:

I have built few Spark Structured Streaming queries to run on EMR, they are long running queries, and need to run at all times, since they are all ETL type queries, when I submit a job to YARN cluster on EMR, I can submit a single spark application. So that spark application should have multiple streaming queries.

I am confused on how to build/start multiple streaming queries within same submit programmatically.

For ex: I have this code:

case class SparkJobs(prop: Properties) extends Serializable {
  def run() = {
      Type1SparkJobBuilder(prop).build().awaitTermination()
      Type1SparkJobBuilder(prop).build().awaitTermination()
  }
}

I fire this in my main class with SparkJobs(new Properties()).run()

When I see in the spark history server, only the first spark streaming job (Type1SparkJob) is running.

What is the recommended way to fire multiple streaming queries within same spark submit programatically, I could not find proper documentation either.

811

asked Oct 11 '18 14:10

Naveen Cotha

1 Answers

Since you're calling awaitTermination on the first query it's going to block until it completes before starting the second query. So you want to kick off both queries, but then use StreamingQueryManager.awaitAnyTermination.

val query1 = df.writeStream.start()
val query2 = df.writeStream.start()

spark.streams.awaitAnyTermination()

In addition to the above, by default Spark uses the FIFO scheduler. Which means the first query gets all resources in the cluster while it's executing. Since you're trying to run multiple queries concurrently you should switch to the FAIR scheduler

If you have some queries that should have more resources than the others then you can also tune the individual scheduler pools.

131

answered Sep 23 '22 13:09

Silvio

Related questions
                            
                                How to save a huge pandas dataframe to hdfs?
                            
                                how to pass python package to spark job and invoke main file from package with arguments
                            
                                scala vs java for Spark? [closed]
                            
                                Spark jobs finishes but application takes time to close
                            
                                Is foreachRDD executed on the Driver?
                            
                                Add one more StructField to schema
                            
                                Loading compressed gzipped csv file in Spark 2.0
                            
                                What is StringIndexer , VectorIndexer, and how to use them?
                            
                                Mapping Spark DataSet row values into new hash column
                            
                                External Hive Table Refresh table vs MSCK Repair
                            
                                get first N elements from dataframe ArrayType column in pyspark
                            
                                Spark: save DataFrame partitioned by "virtual" column
                            
                                Spark: get number of cluster cores programmatically
                            
                                How do I filter rows based on whether a column value is in a Set of Strings in a Spark DataFrame
                            
                                what is exact difference between Spark Transform in DStream and map.?
                            
                                How do I convert an RDD with a SparseVector Column to a DataFrame with a column as Vector
                            
                                is Parquet predicate pushdown works on S3 using Spark non EMR?
                            
                                Spark: Join dataframe column with an array
                            
                                Write spark dataframe to file using python and '|' delimiter
                            
                                How to use from_json with Kafka connect 0.10 and Spark Structured Streaming?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to start multiple streaming queries in a single Spark application?

Tags:

apache-spark

spark-structured-streaming

Naveen Cotha

People also ask

1 Answers

Silvio

Recent Activity

Donate For Us