Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How Apache Flink deal with skewed data?

Tags:

apache-flink

For example, I have a big stream of words and want to count each word. The problem is these words is skewed. It means that the frequency of some words would be very high, but that of most other words is low. In storm, we could use the following way to solve this issue. First do shuffle grouping on the stream, in each node count words local in a window time, at the end update counts to cumulative results. From my another question, I know that Flink only supports window on a keyed stream, otherwise the window operation will not be parallel.

My question is is there a good way to solve this kind of skewed data issue in Flink?

enter image description here

like image 653
Jun Avatar asked Jan 08 '16 16:01

Jun


1 Answers

Pre-aggregation is currently not natively supported by the DataStream API. In principle, it is possible to add a combiner-like feature for event-time windows. IMO, this would be a very valuable addition but hasn't been done yet.

However, you can implement this feature yourself. The DataStream API offers low-level operator interface which is similar to Storm Bolts. The interface is called OneInputStreamOperator. This operator type gives you full control. In fact, the built-in operators (such as Window operators) are also based on this class.

A OneInputStreamOperator can be applied like:

DataStream<Tuple2<String,Integer> inStream = ...
DataStream<String> outStream = inStream
  .transform("my op", BasicTypeInfo.STRING_TYPE_INFO, new MyOISO());
like image 166
Fabian Hueske Avatar answered Oct 09 '22 02:10

Fabian Hueske