Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How do Apache Flink's JoinFunction and CoGroupFunction differ?

Tags:

apache-flink

What is the difference between a JoinFunction and a CoGroupFunction in Apache Flink? How do semantics and execution differ?

like image 446
Jary zhen Avatar asked Apr 07 '16 09:04

Jary zhen


People also ask

How does KeyBy work Flink?

KeyBy is one of the mostly used transformation operator for data streams. It is used to partition the data stream based on certain properties or keys of incoming data objects in the stream. Once we apply the keyBy, all the data objects with same type of keys are grouped together.

Can Flink have multiple sinks?

It's fine to connect a source to multiple sink, the source gets executed only once and records get broadcasted to the multiple sinks.

What is a window join?

A window join adds the dimension of time into the join criteria themselves. In doing so, the window join joins the elements of two streams that share a common key and are in the same window. The semantic of window join is same to the DataStream window join.

Which of the following is the dataset transformations in Apache Flink?

A variety of transformations includes mapping, filtering, sorting, joining, grouping and aggregating. These transformations by Apache Flink are performed on distributed data.


1 Answers

Both, Join and CoGroup transformations join two inputs on key fields. The differences is how the user functions are called:

  • the Join transformation calls the JoinFunction with pairs of matching records from both inputs that have the same values for key fields. This behavior is very similar to an equality inner join.
  • the CoGroup transformation calls the CoGroupFunction with iterators over all records of both inputs that have the same values for key fields. If an input has no records for a certain key value an empty iterator is passed. The CoGroup transformation can be used, among other things, for inner and outer equality joins. It is hence more generic than the Join transformation.

Looking at the execution strategies of Join and CoGroup, Join can be executed using sort- and hash-based join strategies where as CoGroup is always executed using sort-based strategies. Hence, joins are often more efficient than cogroups and should be preferred if possible.

like image 166
Fabian Hueske Avatar answered Oct 28 '22 11:10

Fabian Hueske