Can I use Flink state to perform join?

Tags:

apache-flink

I am evaluating Apache Flink for stream processing as a replacement/complement of Apache Spark. One of the tasks we are usually solving with Spark is data enrichment.

I.e, I have stream of data from IoT sensors with sensor ID and I have set of sensors metadata. I want to transform input stream to stream of sensor measure+sensor metadata.

In Spark I can join DStream with RDD.

case calss SensorValue(sensorId: Long, ...)
case class SensorMetadata(sensorId: Long, ...)
val sensorInput: DStream[SensorValue] = readEventsFromKafka()
val staticMetadata: RDD[(Long, SensorMetadata)] =
  spark.read.json(...).as[SensorMetadata]
 .map {s => (s.sensorId, s)}.rdd
val joined: DStream[(SensorValue, SensorMetadata)] = 
  sensorInput.map{s => (s.sensorId, s)}.transform { rdd: RDD[SensorValue] => 
  rdd.join(staticMetadata)
     .map { case (_, (s, m)) => (s, m) } // Get rid of nested tuple
}

Can I do same trick with Apache Flink? I see no direct API on this. Only idea I have is to use stateful transformation - I can merge metadata and sensor events in a single stream and use Flink state storage to store metadata (pseudocode):

val sensorInput: DataStream[SensorValue] = readEventsFromKafka()
val statisMetadata: DataStream[SensorMetadata] = readMetadataFromJson()
val result: DataStream[(SensorValue, SensorMetadata)] =
  sensorInput.keyBy("sensorId")
 .connect(staticMetadata.keyBy("sensorId"))
 .flatMap {new RichCoFlatMapFunction() {
   private val ValueState<SensorMetadata> md = _;
   override def open = ??? // initiate value state
   def flatMap1(s: SensorEvent, s: Collector(SensorEvent, SensorMetadata)) = 
      collector.collect(s, md.value) 
   def flatMap2(s: SensorMetadata, s: Collector[(SensorEvent, SensorMetadata)]) = 
   md.update(s)  
 }}

Is this correct approach? Can I use under larger scale, when metadata doesn't fit on one machine?

Thanks

793

asked Oct 18 '16 06:10

Yura Taras

1 Answers

Using a CoFlatMapFunction to join is a common approach. However, it has one significant drawback. The function is called whenever a tuple of either input arrives and you cannot control which input to consume first. So in the beginning, you will have to handle sensor events when the metadata has not been completely read. One approach is to buffer all events of one input until the other input is consumed. On the other hand, the CoFlatMapFunction approach has the benefit that you can dynamically update the metadata. In your code example, both inputs are keyed on the join key. That means that the input is partitioned and each taskslot is processing a different key set. Hence, your metadata can be larger than what a machine can handle (if you configure the RocksDB state backend the state can be persisted to disk, so you are not even bound by the size of the memory).

If you require that all metadata must be present when the job starts and if the metadata is static (it does not change) and is small enough to fit into one machine, you can also use a regular FlatMapFunction and load the metadata in the open() method from a file. In contrast to your approach, this would be a broadcast join, where each taskslot has the complete metadata in memory. Besides having all metadata available when the event data is consumed, the approach has the benefit that you do not need to shuffle the event data because it can be joined on any machine.

167

answered Oct 10 '22 09:10

Fabian Hueske

Related questions
                            
                                Can mysql join occur on different data types
                            
                                Symfony/Doctrine ManyToMany in the order the are assigned
                            
                                Select data from three table in sql
                            
                                NullReferenceException when Selecting from Left Join
                            
                                sqlalchemy joinedload: syntax to load multiple relationships more than 1 degree separated from query table?
                            
                                Laravel sum query using join and distinct
                            
                                Combine Multiple Query Results in MySQL (by column)
                            
                                sqlalchemy: referencing label()'d column in a filter or clauselement
                            
                                MySQL multiple table joins
                            
                                Are Mondrian / OLAP the wrong tool for joining large dimensions/sets?
                            
                                Cannot modify a column which maps to a non key-preserved table error while trying to insert into a view
                            
                                How do I create a join using a 'greater than' and a 'group by'?
                            
                                Query using ILIKE with IN
                            
                                Join query of two databases in codeigniter
                            
                                Inserting the values from another table while creating new table
                            
                                MySQL select based on ENUM values
                            
                                SQL Server SELECT paging with JOIN
                            
                                How to avoid too many joins?
                            
                                Why is subquery and join so slow
                            
                                Unnecessary join in Entity framework when using imported function?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With