I have, I believe, a relatively common use case for spark streaming: I have a stream of objects that I would like to filter based on some reference data Initially, I thought that this would be a very simple thing to achieve using a Broadcast Variable: <pre class="prettyprint"><code>public void startSparkEngine { Broadcast<ReferenceData> refdataBroadcast = sparkContext.broadcast(getRefData()); final JavaDStream<MyObject> filteredStream = objectStream.filter(obj -> { final ReferenceData refData = refdataBroadcast.getValue(); return obj.getField().equals(refData.getField()); } filteredStream.foreachRDD(rdd -> { rdd.foreach(obj -> { // Final processing of filtered objects }); return null; }); } </code></pre> However, albeit infrequently, my reference data will change periodically I was under the impression that I could modify and re-broadcast my variable on the driver and it would be propagated to each of the workers, however the <code>Broadcast</code> object is not <code>Serializable</code> and needs to be <code>final</code>. What alternatives do I have? The three solutions I can think of are: <ol> <li>Move the reference data lookup into a <code>forEachPartition</code> or <code>forEachRdd</code> so that it resides entirely on the workers. However the reference data lives beind a REST API so I would also need to somehow store a timer / counter to stop the remote being accessed for every element in the stream.</li> <li>Restart the Spark Context every time the refdata changes, with a new Broadcast Variable.</li> <li>Convert the Reference Data to an RDD, then <code>join</code> the streams in such a way that I am now streaming <code>Pair<MyObject, RefData></code>, though this will ship the reference data with every object.</li> </ol>

Extending the answer By @Rohan Aletty. Here is a sample code of a BroadcastWrapper that refresh broadcast variable based on some ttl <pre class="prettyprint"><code>public class BroadcastWrapper { private Broadcast<ReferenceData> broadcastVar; private Date lastUpdatedAt = Calendar.getInstance().getTime(); private static BroadcastWrapper obj = new BroadcastWrapper(); private BroadcastWrapper(){} public static BroadcastWrapper getInstance() { return obj; } public JavaSparkContext getSparkContext(SparkContext sc) { JavaSparkContext jsc = JavaSparkContext.fromSparkContext(sc); return jsc; } public Broadcast<ReferenceData> updateAndGet(SparkContext sparkContext){ Date currentDate = Calendar.getInstance().getTime(); long diff = currentDate.getTime()-lastUpdatedAt.getTime(); if (var == null || diff > 60000) { //Lets say we want to refresh every 1 min = 60000 ms if (var != null) var.unpersist(); lastUpdatedAt = new Date(System.currentTimeMillis()); //Your logic to refresh ReferenceData data = getRefData(); var = getSparkContext(sparkContext).broadcast(data); } return var; } } </code></pre> Your code would look like : <pre class="prettyprint"><code>public void startSparkEngine() { final JavaDStream<MyObject> filteredStream = objectStream.transform(stream -> { Broadcast<ReferenceData> refdataBroadcast = BroadcastWrapper.getInstance().updateAndGet(stream.context()); stream.filter(obj -> obj.getField().equals(refdataBroadcast.getValue().getField())); }); filteredStream.foreachRDD(rdd -> { rdd.foreach(obj -> { // Final processing of filtered objects }); return null; }); } </code></pre> This worked for me on multi-cluster as well. Hope this helps

How can I update a broadcast variable in spark streaming?

Tags:

java

scala

apache-spark

broadcast

spark-streaming

I have, I believe, a relatively common use case for spark streaming:

I have a stream of objects that I would like to filter based on some reference data

Initially, I thought that this would be a very simple thing to achieve using a Broadcast Variable:

public void startSparkEngine {     Broadcast<ReferenceData> refdataBroadcast       = sparkContext.broadcast(getRefData());      final JavaDStream<MyObject> filteredStream = objectStream.filter(obj -> {         final ReferenceData refData = refdataBroadcast.getValue();         return obj.getField().equals(refData.getField());     }      filteredStream.foreachRDD(rdd -> {         rdd.foreach(obj -> {             // Final processing of filtered objects         });         return null;     }); }

However, albeit infrequently, my reference data will change periodically

I was under the impression that I could modify and re-broadcast my variable on the driver and it would be propagated to each of the workers, however the Broadcast object is not Serializable and needs to be final.

What alternatives do I have? The three solutions I can think of are:

Move the reference data lookup into a forEachPartition or forEachRdd so that it resides entirely on the workers. However the reference data lives beind a REST API so I would also need to somehow store a timer / counter to stop the remote being accessed for every element in the stream.
Restart the Spark Context every time the refdata changes, with a new Broadcast Variable.
Convert the Reference Data to an RDD, then join the streams in such a way that I am now streaming Pair<MyObject, RefData>, though this will ship the reference data with every object.

466

asked Oct 27 '15 15:10

Andrew Stubbs

2 Answers

Extending the answer By @Rohan Aletty. Here is a sample code of a BroadcastWrapper that refresh broadcast variable based on some ttl

public class BroadcastWrapper {      private Broadcast<ReferenceData> broadcastVar;     private Date lastUpdatedAt = Calendar.getInstance().getTime();      private static BroadcastWrapper obj = new BroadcastWrapper();      private BroadcastWrapper(){}      public static BroadcastWrapper getInstance() {         return obj;     }      public JavaSparkContext getSparkContext(SparkContext sc) {        JavaSparkContext jsc = JavaSparkContext.fromSparkContext(sc);        return jsc;     }      public Broadcast<ReferenceData> updateAndGet(SparkContext sparkContext){         Date currentDate = Calendar.getInstance().getTime();         long diff = currentDate.getTime()-lastUpdatedAt.getTime();         if (var == null || diff > 60000) { //Lets say we want to refresh every 1 min = 60000 ms             if (var != null)                var.unpersist();             lastUpdatedAt = new Date(System.currentTimeMillis());              //Your logic to refresh             ReferenceData data = getRefData();              var = getSparkContext(sparkContext).broadcast(data);        }        return var;    } }

Your code would look like :

public void startSparkEngine() {      final JavaDStream<MyObject> filteredStream = objectStream.transform(stream -> {         Broadcast<ReferenceData> refdataBroadcast = BroadcastWrapper.getInstance().updateAndGet(stream.context());          stream.filter(obj -> obj.getField().equals(refdataBroadcast.getValue().getField()));     });      filteredStream.foreachRDD(rdd -> {         rdd.foreach(obj -> {         // Final processing of filtered objects         });         return null;     }); }

This worked for me on multi-cluster as well. Hope this helps

answered Sep 27 '22 20:09

Aastha

Recently faced issue with this. Thought it might be helpful for scala users..

Scala way of doing BroadCastWrapper is like below example.

import java.io.{ ObjectInputStream, ObjectOutputStream } import org.apache.spark.broadcast.Broadcast import org.apache.spark.streaming.StreamingContext import scala.reflect.ClassTag  /* wrapper lets us update brodcast variables within DStreams' foreachRDD  without running into serialization issues */ case class BroadcastWrapper[T: ClassTag](  @transient private val ssc: StreamingContext,   @transient private val _v: T) {    @transient private var v = ssc.sparkContext.broadcast(_v)    def update(newValue: T, blocking: Boolean = false): Unit = {      v.unpersist(blocking)     v = ssc.sparkContext.broadcast(newValue)   }    def value: T = v.value    private def writeObject(out: ObjectOutputStream): Unit = {     out.writeObject(v)   }    private def readObject(in: ObjectInputStream): Unit = {     v = in.readObject().asInstanceOf[Broadcast[T]]   } }

Every time you need to call update function to get new broadcast variable.

answered Sep 27 '22 18:09

Ram Ghadiyaram

Related questions
                            
                                Convert .class to .java
                            
                                Should you define a null/unknown value for Java enums?
                            
                                Where to download sqlitejdbc? [closed]
                            
                                How to disable Hibernate validation in a Spring Boot project
                            
                                How do I stop spring data JPA from doing a SELECT before a save()?
                            
                                Any clever ways of handling the context in a web app?
                            
                                Crossplatform iPhone / Android code sharing
                            
                                Suppressing the "Picked up _JAVA_OPTIONS" message [closed]
                            
                                Java signed zero and boxing
                            
                                Using local json file in Android
                            
                                Spring Global CORS configuration not working but Controller level config does
                            
                                Rest api - update single field of resource
                            
                                Unable to use Lombok with Java 11
                            
                                Why hypot() function is so slow?
                            
                                What type would you map BigDecimal in Java/Hibernate in MySQL?
                            
                                Why explicit type argument should be replaced by diamond? [duplicate]
                            
                                Does Thread.yield() do anything if we have enough processors to service all threads?
                            
                                Hibernate/Spring: failed to lazily initialize - no session or session was closed
                            
                                Maven failing to download jar dependencies
                            
                                IntelliJ IDEA debugger for Java: return desirable value from method during debug

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With