Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Using Dataflow to Remove Duplicates

I have a large datafile (1 TB) of data to import into BigQuery. Each line contains a key. While importing the data and creating my PCollection to export to BigQuery, I'd like to insure that I am not importing duplicate records based on this key value. What would be the most efficient approach to doing this in my Java program using Dataflow?

Thanks

like image 669
Alex Harvey Avatar asked Dec 25 '22 23:12

Alex Harvey


2 Answers

GroupByKey concept in Dataflow allows arbitrary groupings, which can be leveraged to remove duplicate keys from a PCollection.

The most generic approach to this problem would be:

  • read from your source file, producing a PCollection of input records,
  • use a ParDo transform to separate keys and values, producing a PCollection of KV,
  • perform a GroupByKey operation on it, producing a PCollection of KV>,
  • use a ParDo transform to select which value mapped to the given key should be written, producing PCollection of KV,
  • use a ParDo transform to format the data for writing,
  • finally, write the results to BigQuery or any other sink.

Some of these steps may be omitted, if you are solving a particular special case of the generic problem.

In particular, if the entire record is considered a key, the problem can be simplified to just running a Count transform and iterating over the resulting PCollection.

Here's an approximate code example for GroupByKey:

PCollection<KV<String, Doc>> urlDocPairs = ...;
PCollection<KV<String, Iterable<Doc>>> urlToDocs =
    urlDocPairs.apply(GroupByKey.<String, Doc>create());
PCollection<KV<String, Doc>> results = urlToDocs.apply(
    ParDo.of(new DoFn<KV<String, Iterable<Doc>>, KV<String, Doc>>() {
      public void processElement(ProcessContext c) {
        String url = c.element().getKey();
        Iterable<Doc> docsWithThatUrl = c.element().getValue();
        // return a pair of url and an element from Iterable<Doc>.
    }}));
like image 36
Sam McVeety Avatar answered Mar 24 '23 20:03

Sam McVeety


The following might be worth a look

https://cloud.google.com/dataflow/java-sdk/JavaDoc/com/google/cloud/dataflow/sdk/transforms/RemoveDuplicates

like image 91
Reza Rokni Avatar answered Mar 24 '23 21:03

Reza Rokni