Unbounded table is spark structured streaming

Tags:

I'm starting to learn Spark and am having a difficult time understanding the rationality behind Structured Streaming in Spark. Structured streaming treats all the data arriving as an unbounded input table, wherein every new item in the data stream is treated as new row in the table. I have the following piece of code to read in incoming files to the csvFolder.

val spark = SparkSession.builder.appName("SimpleApp").getOrCreate()

val csvSchema = new StructType().add("street", "string").add("city", "string")
.add("zip", "string").add("state", "string").add("beds", "string")
.add("baths", "string").add("sq__ft", "string").add("type", "string")
.add("sale_date", "string").add("price", "string").add("latitude", "string")
.add("longitude", "string")

val streamingDF = spark.readStream.schema(csvSchema).csv("./csvFolder/")

val query = streamingDF.writeStream
  .format("console")
  .start()

What happens if I dump a 1GB file to the folder. As per the specs, the streaming job is triggered every few milliseconds. If Spark encounters such a huge file in the next instant, won't it run out of memory while trying to load the file. Or does it automatically batch it? If yes, is this batching parameter configurable?

320

asked May 20 '17 23:05

Shubham Mittal

1 Answers

See the example

The key idea is to treat any data stream as an unbounded table: new records added to the stream are like rows being appended to the table. enter image description here This allows us to treat both batch and streaming data as tables. Since tables and DataFrames/Datasets are semantically synonymous, the same batch-like DataFrame/Dataset queries can be applied to both batch and streaming data.

In Structured Streaming Model, this is how the execution of this query is performed. enter image description here

Question : If Spark encounters such a huge file in the next instant, won't it run out of memory while trying to load the file. Or does it automatically batch it? If yes, is this batching parameter configurable?

Answer : There is no point of OOM since it is RDD(DF/DS)lazily initialized. of course you need to re-partition before processing to ensure equal number of partitions and data spread across executors uniformly...

150

answered Oct 21 '22 20:10

Ram Ghadiyaram

Related questions
                            
                                Understanding flatMap declaration in List
                            
                                Scaladoc: @group tag not showing in API documentation
                            
                                HashMap in scala.collection.mutable is invariant but immutable.HashMap is covariant, why?
                            
                                How to specify indentations on multiline parameter lists in IntelliJ Scala?
                            
                                Obtaining the client IP in Akka-http
                            
                                What is the difference between Future and future?
                            
                                Scala what is the difference between defining a method in the class instead on the companion object
                            
                                Could not find implicit value while using Context Bound
                            
                                scala case class too many fields
                            
                                How to retrieve the column having datatype as "list" from the table of Cassandra?
                            
                                An object with unapply working in middle of a case statement
                            
                                In Spark Streaming, is there a way to detect when a batch has finished?
                            
                                Scala override method with subclass as parameter type
                            
                                Error using reactivemongo 0.12.1 with play 2.5.X
                            
                                Unable to access file in relative path in Scala for test resource
                            
                                How to construct an actor together with its wrapper?
                            
                                How can I write and read an empty case class with play-json?
                            
                                How to map struct in DataFrame to case class?
                            
                                How to use spark quantilediscretizer on multiple columns
                            
                                Why do I need to use andThen in order to pattern match Futures?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Unbounded table is spark structured streaming

Tags:

scala

apache-spark

spark-structured-streaming

Shubham Mittal

People also ask

1 Answers

See the example

Ram Ghadiyaram

Recent Activity

Donate For Us