Spark Streaming with large number of streams and models used for analytical processing of RDDs

Tags:

We are creating a real-time stream processing system with spark streaming which uses large number (millions) of analytic models applied to RDDs in the many different type of incoming metric data streams(more then 100000). This streams are original or transformed streams. Each RDD has to go through an analytical model for processing. Since we do not know which spark cluster node will process which specific RDDs from different streams, we need to make ALL these models available at each Spark compute node. This will create huge overhead at each spark node. We are considering using in-memory data grids to provide these models at spark compute nodes. Is this the right approach?

Should we avoid using Spark streaming all together and just use in-memory data grids like Redis(with pub/sub) to solve this problem. In that case we will stream data to specific Redis nodes which contain the specific models. of course we will have to do all binning/window etc..

Please suggest.

474

asked Jun 16 '14 14:06

Tribhuwan Negi

1 Answers

Sounds like to me like you need a combination of stream processing engine and a distributed data store. I would design the system like this.

The distributed datastore (Redis, Cassandra, etc.) can have the data you want to access from all the nodes.
Receive the data streams through a combination data ingestion system (Kafka, Flume, ZeroMQ, etc.) and process it in the stream processing system (Spark Streaming [preferably ;)], Storm, etc.).
In the functions that is used to process the stream records, the necessary data will have to pulled from the data store and maybe cached locally as appropriate.
You may also have to update the data store from spark streaming as application needs it. In which case you will also have to worry about versioning of the data that you want pull in step 3.

Hopefully that made sense. Its hard to give any more specifics of the implementation without the exactly computation model. Hope this helps!

109

answered Oct 02 '22 23:10

Tathagata Das

Related questions
                            
                                Handling empty arrays in pySpark (optional binary element (UTF8) is not a group)
                            
                                Spark Scheduling Within an Application : performance issue
                            
                                Pyspark: Delta table as stream source, How to do it?
                            
                                Build a hierarchy from a relational data-set using Pyspark
                            
                                Spark Memory Overhead
                            
                                How to use kafka.group.id and checkpoints in spark 3.0 structured streaming to continue to read from Kafka where it left off after restart?
                            
                                Saving an Matlabplot as an MLFlow artifact
                            
                                Read spark data with column that clashes with partition name
                            
                                Spark/Scala Opening Zipped CSV Files
                            
                                IOException: Cannot run program "javac" when "sudo ./sbt/sbt compile" in Spark?
                            
                                Import TSV File in spark
                            
                                Spark lists all leaf node even in partitioned data
                            
                                Spark: increase number of partitions without causing a shuffle?
                            
                                Remove duplicates from a dataframe in PySpark
                            
                                How to get rid of derby.log, metastore_db from Spark Shell
                            
                                What is the difference between HashingTF and CountVectorizer in Spark?
                            
                                How to map features from the output of a VectorAssembler back to the column names in Spark ML?
                            
                                How to add a Spark Dataframe to the bottom of another dataframe?
                            
                                Joining two DataFrames in Spark SQL and selecting columns of only one
                            
                                How to group by time interval in Spark SQL

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Spark Streaming with large number of streams and models used for analytical processing of RDDs

Tags:

redis

apache-spark

spark-streaming

Tribhuwan Negi

People also ask

1 Answers

Tathagata Das

Recent Activity

Donate For Us