Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How do you express denormalization joins in Apache Beam that stretch over long periods of time

Tags:

For context I've never used Beam. I'm trying to understand how to apply the Beam model to common use cases.

Consider you have an unbounded collection of Producers and an unbounded collection of Products such that each Product has a Producer (one to many, Producer to Product). And you have the additional property that a Product's Producer appears before (or shortly after) its Product. But a Producer may appear years before its Product.

If you want to produce an unbounded collection of Products with their Producers joined with them what's the appropriate way to express this? Having a windowed join that stretches out years seems to defeat the point of the window. But having the Producers as a side input doesn't seem to handle that Producers may appear very closely to when the Product appears.

Is there an appropriate way to mix these two concepts?

like image 437
Steven Noble Avatar asked Nov 14 '17 21:11

Steven Noble


1 Answers

Since Producer may appear years before its Product, you can use some external storage (e.g. BigTable) to store your Producers and write a ParDo for Product stream to do lookups and perform join. To further optimize performance, you can take advantage of stateful DoFn feature to batch lookups (checkout this blog).

You can still use windowing and CoGroupByKey to do join for cases where Product data is delivered before Producer data. However, the window here can be small enough just to handle out-of-order delivery.

like image 90
Jiayuan Ma Avatar answered Oct 20 '22 19:10

Jiayuan Ma