We are developing Spark framework wherein we are moving historical data into RDD sets. Basically, RDD is immutable, read only dataset on which we do operations. Based on that we have moved historical data into RDD and we do computations like filtering/mapping, etc on such RDDs. Now there is a use case where a subset of the data in the RDD gets updated and we have to recompute the values. HistoricalData is in the form of RDD. I create another RDD based on request scope and save the reference of that RDD in a ScopeCollection So far I have been able to think of below approaches - Approach1: broadcast the change: <ol> <li>For each change request, my server fetches the scope specific RDD and spawns a job</li> <li>In a job, apply a map phase on that RDD - 2.a. for each node in the RDD do a lookup on the broadcast and create a new Value which is now updated, thereby creating a new RDD 2.b. now I do all the computations again on this new RDD at step2.a. like multiplication, reduction etc 2.c. I Save this RDDs reference back in my ScopeCollection </li> </ol> Approach2: create an RDD for the updates <ol> <li>For each change request, my server fetches the scope specific RDD and spawns a job</li> <li>On each RDD, do a join with the new RDD having changes </li> <li>now I do all the computations again on this new RDD at step2 like multiplication, reduction etc</li> </ol> Approach 3: I had thought of creating streaming RDD where I keep updating the same RDD and do re-computation. But as far as I understand it can take streams from Flume or Kafka. Whereas in my case the values are generated in the application itself based on user interaction. Hence I cannot see any integration points of streaming RDD in my context. Any suggestion on which approach is better or any other approach suitable for this scenario. TIA!

The usecase presented here is a good match for Spark Streaming. The two other options bear the question: "How do you submit a re-computation of the RDD?" Spark Streaming offers a framework to continuously submit work to Spark based on some stream of incoming data and preserve that data in RDD form. Kafka and Flume are only two possible Stream sources. You could use Socket communication with the SocketInputDStream, reading files in a directory using FileInputDStream or even using shared Queue with the QueueInputDStream. If none of those options fit your application, you could write your own InputDStream. In this usecase, using Spark Streaming, you will read your base RDD and use the incoming dstream to incrementally transform the existing data and maintain an evolving in-memory state. <code>dstream.transform</code> will allow you to combine the base RDD with the data collected during a given batch interval, while the <code>updateStateByKey</code> operation could help you build an in-memory state addressed by keys. See the documentation for further information. Without more details on the application is hard to go up to the code level on what's possible using Spark Streaming. I'd suggest you to explore this path and make new questions for any specific topics.

How to update an RDD?

Tags:

We are developing Spark framework wherein we are moving historical data into RDD sets.

Basically, RDD is immutable, read only dataset on which we do operations. Based on that we have moved historical data into RDD and we do computations like filtering/mapping, etc on such RDDs.

Now there is a use case where a subset of the data in the RDD gets updated and we have to recompute the values.

HistoricalData is in the form of RDD. I create another RDD based on request scope and save the reference of that RDD in a ScopeCollection

So far I have been able to think of below approaches -

Approach1: broadcast the change:

For each change request, my server fetches the scope specific RDD and spawns a job
In a job, apply a map phase on that RDD -

2.a. for each node in the RDD do a lookup on the broadcast and create a new Value which is now updated, thereby creating a new RDD
2.b. now I do all the computations again on this new RDD at step2.a. like multiplication, reduction etc
2.c. I Save this RDDs reference back in my ScopeCollection

Approach2: create an RDD for the updates

For each change request, my server fetches the scope specific RDD and spawns a job
On each RDD, do a join with the new RDD having changes
now I do all the computations again on this new RDD at step2 like multiplication, reduction etc

Approach 3:

I had thought of creating streaming RDD where I keep updating the same RDD and do re-computation. But as far as I understand it can take streams from Flume or Kafka. Whereas in my case the values are generated in the application itself based on user interaction. Hence I cannot see any integration points of streaming RDD in my context.

Any suggestion on which approach is better or any other approach suitable for this scenario.

TIA!

918

asked Dec 16 '14 11:12

user1441849

1 Answers

The usecase presented here is a good match for Spark Streaming. The two other options bear the question: "How do you submit a re-computation of the RDD?"

Spark Streaming offers a framework to continuously submit work to Spark based on some stream of incoming data and preserve that data in RDD form. Kafka and Flume are only two possible Stream sources.

You could use Socket communication with the SocketInputDStream, reading files in a directory using FileInputDStream or even using shared Queue with the QueueInputDStream. If none of those options fit your application, you could write your own InputDStream.

In this usecase, using Spark Streaming, you will read your base RDD and use the incoming dstream to incrementally transform the existing data and maintain an evolving in-memory state. dstream.transform will allow you to combine the base RDD with the data collected during a given batch interval, while the updateStateByKey operation could help you build an in-memory state addressed by keys. See the documentation for further information.

Without more details on the application is hard to go up to the code level on what's possible using Spark Streaming. I'd suggest you to explore this path and make new questions for any specific topics.

137

answered Oct 14 '22 15:10

maasg

Related questions
                            
                                What is the official status of C99 support in VS2013?
                            
                                Integrating Crashlytics to library project
                            
                                How to quit a full screen web app
                            
                                Android: converting standard XML to Android Binary XML format (AXML)
                            
                                YouTube Cards API
                            
                                What is the difference between declaration and definition of a variable in C++? [duplicate]
                            
                                Searching for multiple partial phrases so that one original phrase can not match multiple search phrases
                            
                                Query Builder and Group By on two columns in Symfony2 / Doctrine generate duplicates
                            
                                HTTP2 with CURL gives "Unsupported Protocol"
                            
                                await does not resume context after async operation?
                            
                                Swift Protocol - Property type subclass
                            
                                How to avoid piling up callbacks or "callback hell"?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With