I have a large datafile (1 TB) of data to import into BigQuery. Each line contains a key. While importing the data and creating my PCollection to export to BigQuery, I'd like to insure that I am not importing duplicate records based on this key value. What would be the most efficient approach to doing this in my Java program using Dataflow?
Thanks
GroupByKey concept in Dataflow allows arbitrary groupings, which can be leveraged to remove duplicate keys from a PCollection.
The most generic approach to this problem would be:
Some of these steps may be omitted, if you are solving a particular special case of the generic problem.
In particular, if the entire record is considered a key, the problem can be simplified to just running a Count transform and iterating over the resulting PCollection.
Here's an approximate code example for GroupByKey:
PCollection<KV<String, Doc>> urlDocPairs = ...;
PCollection<KV<String, Iterable<Doc>>> urlToDocs =
urlDocPairs.apply(GroupByKey.<String, Doc>create());
PCollection<KV<String, Doc>> results = urlToDocs.apply(
ParDo.of(new DoFn<KV<String, Iterable<Doc>>, KV<String, Doc>>() {
public void processElement(ProcessContext c) {
String url = c.element().getKey();
Iterable<Doc> docsWithThatUrl = c.element().getValue();
// return a pair of url and an element from Iterable<Doc>.
}}));
The following might be worth a look
https://cloud.google.com/dataflow/java-sdk/JavaDoc/com/google/cloud/dataflow/sdk/transforms/RemoveDuplicates
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With