Periodic Broadcast in Apache Spark Streaming

Tags:

I am implementing a stream learner for text classification. There are some single-valued parameters in my implementation that needs to be updated as new stream items arrive. For example, I want to change learning rate as the new predictions are made. However, I doubt that there is a way to broadcast variables after the initial broadcast. So what happens if I need to broadcast a variable every time I update it. If there is a way to do it or a workaround for what I want to accomplish in Spark Streaming, I'd be happy to hear about it.

Thanks in advance.

327

asked Feb 18 '15 00:02

bfaskiplar

2 Answers

I got this working by creating a wrapper class over the broadcast variable. The updateAndGet method of wrapper class returns the refreshed broadcast variable. I am calling this function inside dStream.transform -> as per the Spark Documentation

http://spark.apache.org/docs/latest/streaming-programming-guide.html#transform-operation

Transform Operation states: "the supplied function gets called in every batch interval. This allows you to do time-varying RDD operations, that is, RDD operations, number of partitions, broadcast variables, etc. can be changed between batches."

BroadcastWrapper class will look like :

public class BroadcastWrapper {
private Broadcast<ReferenceData> broadcastVar;
private Date lastUpdatedAt = Calendar.getInstance().getTime();

private static BroadcastWrapper obj = new BroadcastWrapper();

private BroadcastWrapper(){}

public static BroadcastWrapper getInstance() {
       return obj;
}

public JavaSparkContext getSparkContext(SparkContext sc) {
      JavaSparkContext jsc = JavaSparkContext.fromSparkContext(sc);
      return jsc;
}

public Broadcast<ReferenceData> updateAndGet(SparkContext sparkContext){
       Date currentDate = Calendar.getInstance().getTime();
       long diff = currentDate.getTime()-lastUpdatedAt.getTime();
       if (var == null || diff > 60000) { //Lets say we want to refresh every 1 min = 60000 ms
           if (var != null)
              var.unpersist();
           lastUpdatedAt = new Date(System.currentTimeMillis());

           //Your logic to refresh
           ReferenceData data = getRefData();

           var = getSparkContext(sparkContext).broadcast(data);
      }
      return var;
}
}

You can use this broadcast variable updateAndGet function in stream.transform method that allows RDD-RDD transformations

objectStream.transform(stream -> {

  Broadcast<Object> var = BroadcastWrapper.getInstance().updateAndGet(stream.context());

/**Your code to manipulate stream **/
});

Refer to my full answer from this pos :https://stackoverflow.com/a/41259333/3166245

Hope it helps

answered Oct 02 '22 15:10

Aastha

My understanding is once a broadcast variable is initially sent out, it is 'read only'. I believe you can update the broadcast variable on the local nodes, but not on remote nodes.

May be you need to consider doing this 'outside Spark'. How about using a noSQL store (Cassandra ..etc) or even Memcache? You can then update the variable from one task and periodically check this store from other tasks?

answered Oct 02 '22 15:10

Sujee Maniyam

Related questions
                            
                                spark scala : Convert DataFrame OR Dataset to single comma separated string
                            
                                pyspark: Could not find valid SPARK_HOME
                            
                                How to deploy Spark application jar file to Kubernetes cluster?
                            
                                Container killed by YARN for exceeding memory limits
                            
                                Dataframe Join Null-Safe Condition Use
                            
                                Speed up InMemoryFileIndex for Spark SQL job with large number of input files
                            
                                Spark SQL: using collect_set over array values?
                            
                                How to get datediff() in seconds in pyspark?
                            
                                PySpark: ModuleNotFoundError: No module named 'app'
                            
                                Spark FileAlreadyExistsException on Stage Failure
                            
                                Converting a list of rows to a PySpark dataframe
                            
                                Scheduling Spark Jobs Running on Kubernetes via Airflow
                            
                                How to normalize and create similarity matrix in Pyspark?
                            
                                What is the difference between using df.as[T] and df.asInstanceOf[Dataset[T]]?
                            
                                Map function of RDD not being invoked in Scala Spark
                            
                                Scala Spark: Split collection into several RDD?
                            
                                Spark Python Performance Tuning
                            
                                How to create multiple SparkContexts in a console
                            
                                PySpark error: "Input path does not exist"
                            
                                Remotely execute a Spark job on an HDInsight cluster

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Periodic Broadcast in Apache Spark Streaming

Tags:

apache-spark

spark-streaming

bfaskiplar

People also ask

2 Answers

Aastha

Sujee Maniyam

Recent Activity

Donate For Us