Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to update an RDD?

Tags:

We are developing Spark framework wherein we are moving historical data into RDD sets.

Basically, RDD is immutable, read only dataset on which we do operations. Based on that we have moved historical data into RDD and we do computations like filtering/mapping, etc on such RDDs.

Now there is a use case where a subset of the data in the RDD gets updated and we have to recompute the values.

HistoricalData is in the form of RDD. I create another RDD based on request scope and save the reference of that RDD in a ScopeCollection

So far I have been able to think of below approaches -

Approach1: broadcast the change:

  1. For each change request, my server fetches the scope specific RDD and spawns a job
  2. In a job, apply a map phase on that RDD -

    2.a. for each node in the RDD do a lookup on the broadcast and create a new Value which is now updated, thereby creating a new RDD
    2.b. now I do all the computations again on this new RDD at step2.a. like multiplication, reduction etc
    2.c. I Save this RDDs reference back in my ScopeCollection

Approach2: create an RDD for the updates

  1. For each change request, my server fetches the scope specific RDD and spawns a job
  2. On each RDD, do a join with the new RDD having changes
  3. now I do all the computations again on this new RDD at step2 like multiplication, reduction etc

Approach 3:

I had thought of creating streaming RDD where I keep updating the same RDD and do re-computation. But as far as I understand it can take streams from Flume or Kafka. Whereas in my case the values are generated in the application itself based on user interaction. Hence I cannot see any integration points of streaming RDD in my context.

Any suggestion on which approach is better or any other approach suitable for this scenario.

TIA!

like image 918
user1441849 Avatar asked Dec 16 '14 11:12

user1441849


People also ask

Can we update RDD in Spark?

RDDs are immutable (read-only) in nature. You cannot change an original RDD, but you can create new RDDs by performing coarse-grain operations, like transformations, on an existing RDD. An RDD in Spark can be cached and used again for future transformations, which is a huge benefit for users.

Can we edit the data of RDD?

RDDs are immutable in nature i.e. we cannot change the RDD, we need to transform it by applying transformation(s). There are various transformations and actions, which can be applied on RDD.

Can we add data to RDD?

terrible answer given that data can be appended as long as you create a new rdd. That's like saying you can update an immutable list because you can make a second immutable list with more data.


1 Answers

The usecase presented here is a good match for Spark Streaming. The two other options bear the question: "How do you submit a re-computation of the RDD?"

Spark Streaming offers a framework to continuously submit work to Spark based on some stream of incoming data and preserve that data in RDD form. Kafka and Flume are only two possible Stream sources.

You could use Socket communication with the SocketInputDStream, reading files in a directory using FileInputDStream or even using shared Queue with the QueueInputDStream. If none of those options fit your application, you could write your own InputDStream.

In this usecase, using Spark Streaming, you will read your base RDD and use the incoming dstream to incrementally transform the existing data and maintain an evolving in-memory state. dstream.transform will allow you to combine the base RDD with the data collected during a given batch interval, while the updateStateByKey operation could help you build an in-memory state addressed by keys. See the documentation for further information.

Without more details on the application is hard to go up to the code level on what's possible using Spark Streaming. I'd suggest you to explore this path and make new questions for any specific topics.

like image 137
maasg Avatar answered Oct 14 '22 15:10

maasg