Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to get the cartesian product of two PCollections

I'm very new to using Google Cloud Dataflow. I would like to get the Cartesian product of two PCollections. For example, if I have two PCollections (1, 2) and ("hello", "world"), their Cartesian product is ((1, "hello"), (1, "world"), (2, "hello"), (2, "world")).

Any ideas how I could do that? Also, since the Cartesian product could be large, I'm hoping the solution would lazily create the product and thus avoid huge memory consumption.

Thanks!

like image 588
Youness Bennani Avatar asked Jan 26 '16 07:01

Youness Bennani


1 Answers

In general, computing the cartesian product will be expensive. If either (or both) of the collections fit in memory, you can use side-inputs to broadcast the data to all of the workers. So for your example, you'd turn the PCollection<String> into a side input, and then you would have a ParDo that took it as the main input. For each string on the main input, you could access the side-input which had an Iterable<String> of all the values, and you'd output the pairs (or you could in this DoFn choose to output only the pairs that line up).

This will to re-iterate over the entire set of words each time -- if it fits in memory this should be fine. If it has to re-fetch the side input data every time it could be problematic.

Another approach would be to rely on shuffling and keys. Say you wanted to find words with a 3-letter overlap. You can process the dictionary and produced a PCollection of values keyed by the 3-letter prefixes. You can also create the similar PCollection keyed by 3-letter suffixes. Then you can GroupByKey (or CoGroupByKey). After that, you have for each 3-letter key, all the words with that as a prefix and that as a suffix.

like image 164
Ben Chambers Avatar answered Oct 05 '22 09:10

Ben Chambers