Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Complex join with google dataflow

I'm a newbie, trying to understand how we might re-write a batch ETL process into Google Dataflow. I've read some of the docs, run a few examples.

I'm proposing that the new ETL process would be driven by business events (i.e. a source PCollection). These would trigger the ETL process for that particular business entity. The ETL process would extract datasets from source systems and then pass those results (PCollections) onto the next processing stage. The processing stages would involve various types of joins (including cartesian and non-key joins, e.g. date-banded).

So a couple of questions here:

(1) Is the approach that I'm proposing valid & efficient? If not what would be better, I havent seen any presentations on real-world complex ETL processes using Google Dataflow, only simple scenarios.

Are there any "higher-level" ETL products that are a better fit? I've been keeping an eye on Spark and Flink for a while.

Our current ETL is moderately complex, though there are only about 30 core tables (classic EDW dimensions and facts), and ~1000 transformation steps. Source data is complex (roughly 150 Oracle tables).

(2) The complex non-key joins, how would these be handled?

I'm obviously attracted to Google Dataflow because of it being an API first and foremost, and the parallel processing capabilities seem a very good fit (we are being asked to move from batch overnight to incremental processing).

A good worked example of Dataflow for this use case would really push adoption forward!

Thanks, Mike S

like image 489
Mike Smith Avatar asked Jan 27 '16 17:01

Mike Smith


1 Answers

It sounds like Dataflow would be a good fit. We allow you to write a pipeline that takes a PCollection of business events and performs the ETL. The pipeline could either be batch (executed periodically) or streaming (executed whenever input data arrives).

The various joins are for the most part relatively expressible in Dataflow. For the cartesian product, you can look at using side inputs to make the contents of a PCollection available as an input to the processing of each element in another PCollection.

You can also look at using GroupByKey or CoGroupByKey to implement the joins. These flatten multiple inputs, and allow accessing all values with the same key in one place. You can also use Combine.perKey to compute associative and commutative combinations of all the elements associated with a key (eg., SUM, MIN, MAX, AVERAGE, etc.).

Date-banded joins sound like they would be a good fit for windowing which allows you to write a pipeline that consumes windows of data (eg., hourly windows, daily windows, 7 day windows that slide every day, etc.).


Edit: Mention GroupByKey and CoGroupByKey.

like image 192
Ben Chambers Avatar answered Sep 20 '22 07:09

Ben Chambers