How to get the cartesian product of two PCollections

Question

I'm very new to using Google Cloud Dataflow. I would like to get the Cartesian product of two PCollections. For example, if I have two PCollections (1, 2) and ("hello", "world"), their Cartesian product is ((1, "hello"), (1, "world"), (2, "hello"), (2, "world")).

Any ideas how I could do that? Also, since the Cartesian product could be large, I'm hoping the solution would lazily create the product and thus avoid huge memory consumption.

Thanks!

Ben Chambers · Accepted Answer

In general, computing the cartesian product will be expensive. If either (or both) of the collections fit in memory, you can use side-inputs to broadcast the data to all of the workers. So for your example, you'd turn the PCollection<String> into a side input, and then you would have a ParDo that took it as the main input. For each string on the main input, you could access the side-input which had an Iterable<String> of all the values, and you'd output the pairs (or you could in this DoFn choose to output only the pairs that line up).

This will to re-iterate over the entire set of words each time -- if it fits in memory this should be fine. If it has to re-fetch the side input data every time it could be problematic.

Another approach would be to rely on shuffling and keys. Say you wanted to find words with a 3-letter overlap. You can process the dictionary and produced a PCollection of values keyed by the 3-letter prefixes. You can also create the similar PCollection keyed by 3-letter suffixes. Then you can GroupByKey (or CoGroupByKey). After that, you have for each 3-letter key, all the words with that as a prefix and that as a suffix.

How to get the cartesian product of two PCollections

Tags:

google-cloud-dataflow

Youness Bennani

1 Answers

Ben Chambers

Recent Activity

Donate For Us

How to get the cartesian product of two PCollections

Tags:

google-cloud-dataflow

Youness Bennani

1 Answers

Ben Chambers

Related questions

Recent Activity

Donate For Us