I'm very new to using Google Cloud Dataflow. I would like to get the Cartesian product of two PCollections. For example, if I have two PCollections (1, 2)
and ("hello", "world")
, their Cartesian product is ((1, "hello"), (1, "world"), (2, "hello"), (2, "world"))
.
Any ideas how I could do that? Also, since the Cartesian product could be large, I'm hoping the solution would lazily create the product and thus avoid huge memory consumption.
Thanks!
In general, computing the cartesian product will be expensive. If either (or both) of the collections fit in memory, you can use side-inputs to broadcast the data to all of the workers. So for your example, you'd turn the PCollection<String>
into a side input, and then you would have a ParDo
that took it as the main input. For each string on the main input, you could access the side-input which had an Iterable<String>
of all the values, and you'd output the pairs (or you could in this DoFn
choose to output only the pairs that line up).
This will to re-iterate over the entire set of words each time -- if it fits in memory this should be fine. If it has to re-fetch the side input data every time it could be problematic.
Another approach would be to rely on shuffling and keys. Say you wanted to find words with a 3-letter overlap. You can process the dictionary and produced a PCollection
of values keyed by the 3-letter prefixes. You can also create the similar PCollection
keyed by 3-letter suffixes. Then you can GroupByKey
(or CoGroupByKey
). After that, you have for each 3-letter key, all the words with that as a prefix and that as a suffix.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With