Understanding Spark Structured Streaming Parallelism

Tags:

I'm a newbie in the Spark world and struggling with some concepts.

How does parallelism happen when using Spark Structured Streaming sourcing from Kafka ?

Let's consider the following snippet code:

SparkSession spark = SparkSession
          .builder()
          .appName("myApp")
          .getOrCreate();   

Dataset<VideoEventData> ds = spark
  .readStream()
  .format("kafka")
  ...

gDataset = ds.groupByKey(...)

pDataset = gDataset.mapGroupsWithState(
      ...
      /* process each key - values */
      loop values
        if value is valid - save key/value result in the HDFS
      ... 
)

StreamingQuery query = pDataset.writeStream()
          .outputMode("update")
          .format("console")
          .start();

//await
query.awaitTermination();

I've read that the parallelism is related with the number of data partitions, and the number of partitions for a Dataset is based on the spark.sql.shuffle.partitions parameter.

For every batch (pull from the Kafka), will the pulled items be divided among the number of spark.sql.shuffle.partitions? For example, spark.sql.shuffle.partitions=5 and Batch1=100 rows, will we end up with 5 partitions with 20 rows each ?
Considering the snippet code provided, do we still leverage in the Spark parallelism due to the groupByKey followed by a mapGroups/mapGroupsWithState functions ?

UPDATE:

Inside the gDataset.mapGroupsWithState is where I process each key/values and store the result in the HDFS. So, the output sink is being used only to output some stats in the console.

928

asked Jan 13 '18 12:01

Kleyson Rios

1 Answers

For every Batch (pull from the Kafka), will the pulled items be divided among the number of spark.sql.shuffle.partitions?

They will be divided once they reach groupByKey which is a shuffle boundary. When you retrieve the data at first, the number of partitions will be equal to the number of Kafka partitions

Considering the snippet code provided, do we still leverage in the Spark parallelism due to the groupByKey followed by a mapGroups/mapGroupsWithState functions

Generally yes, but it also depends on how you setup your Kafka topic. Although not visible to you from the code, Spark will internally split the data different stage into smaller tasks and distribute them among the available executors in the cluster. If your Kafka topic has only 1 partition, that means that prior to groupByKey, your internal stream will contain a single partition, which won't be parallalized but executed on a single executor. As long as your Kafka partition count is greater than 1, your processing will be parallel. After the shuffle boundary, Spark will re-partition the data to contain the amount of partitions specified by the spark.sql.shuffle.partitions.

195

answered Oct 03 '22 21:10

Yuval Itzchakov

Related questions
                            
                                Apache Spark GraphX connected components
                            
                                What are Spark RDD graph, lineage graph, DAG of Spark tasks? what are their relations
                            
                                Cassandra timeout during read query at consistency ONE (1 responses were required but only 0 replica responded)
                            
                                What is the equivalent to scala.util.Try in pyspark?
                            
                                Google Cloud Dataproc configuration issues
                            
                                Feature normalization algorithm in Spark
                            
                                Joining a large and a ginormous spark dataframe
                            
                                How to properly wait for apache spark launcher job during launching it from another application?
                            
                                Using Futures within Spark
                            
                                How to execute a SQL query against ElasticSearch (using org.elasticsearch.spark.sql format)?
                            
                                Simple command for extracting column names in sparklyr (R+spark)
                            
                                Spark - Reading JSON from Partitioned Folders using Firehose
                            
                                spark dataframe trim column and convert
                            
                                Partitioning with Spark Graphframes
                            
                                PySpark: do I need to re-cache a DataFrame?
                            
                                spark programming: best way to organize context imports and others with multiple functions
                            
                                How does Structured Streaming execute separate streaming queries (in parallel or sequentially)?
                            
                                Passing nullable columns as parameter to Spark SQL UDF
                            
                                Setting spark.speculation in Spark 2.1.0 while writing to s3
                            
                                How to hint for sort merge join or shuffled hash join (and skip broadcast hash join)?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Understanding Spark Structured Streaming Parallelism

Tags:

apache-spark

apache-spark-sql

spark-structured-streaming

Kleyson Rios

People also ask

1 Answers

Yuval Itzchakov

Recent Activity

Donate For Us