Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Spark Streaming with large number of streams and models used for analytical processing of RDDs

We are creating a real-time stream processing system with spark streaming which uses large number (millions) of analytic models applied to RDDs in the many different type of incoming metric data streams(more then 100000). This streams are original or transformed streams. Each RDD has to go through an analytical model for processing. Since we do not know which spark cluster node will process which specific RDDs from different streams, we need to make ALL these models available at each Spark compute node. This will create huge overhead at each spark node. We are considering using in-memory data grids to provide these models at spark compute nodes. Is this the right approach?

Or

Should we avoid using Spark streaming all together and just use in-memory data grids like Redis(with pub/sub) to solve this problem. In that case we will stream data to specific Redis nodes which contain the specific models. of course we will have to do all binning/window etc..

Please suggest.

like image 474
Tribhuwan Negi Avatar asked Jun 16 '14 14:06

Tribhuwan Negi


People also ask

What method does Spark use to perform streaming operations?

Spark Streaming receives live input data streams and divides the data into batches, which are then processed by the Spark engine to generate the final stream of results in batches. Spark Streaming provides a high-level abstraction called discretized stream or DStream, which represents a continuous stream of data.

What is RDD in Spark streaming?

So the data would start pouring in a stream in batches, this continuous stream of data is called DStream. Every batch of dsteam would contain collection of elements that can be processed in parallel, this collection is called RDD.

Which of the following component can be used to perform streaming data analysis in Spark?

Spark Streaming It uses Spark Core's fast scheduling capability to perform streaming analytics.

What are some of the ways of processing streaming data in Apache spark?

Spark Streaming comes with several API methods that are useful for processing data streams. There are RDD-like operations like map, flatMap, filter, count, reduce, groupByKey, reduceByKey, sortByKey , and join. It also provides additional API to process the streaming data based on window and stateful operations.


1 Answers

Sounds like to me like you need a combination of stream processing engine and a distributed data store. I would design the system like this.

  1. The distributed datastore (Redis, Cassandra, etc.) can have the data you want to access from all the nodes.
  2. Receive the data streams through a combination data ingestion system (Kafka, Flume, ZeroMQ, etc.) and process it in the stream processing system (Spark Streaming [preferably ;)], Storm, etc.).
  3. In the functions that is used to process the stream records, the necessary data will have to pulled from the data store and maybe cached locally as appropriate.
  4. You may also have to update the data store from spark streaming as application needs it. In which case you will also have to worry about versioning of the data that you want pull in step 3.

Hopefully that made sense. Its hard to give any more specifics of the implementation without the exactly computation model. Hope this helps!

like image 109
Tathagata Das Avatar answered Oct 02 '22 23:10

Tathagata Das