Using Dataflow to Remove Duplicates

Question

I have a large datafile (1 TB) of data to import into BigQuery. Each line contains a key. While importing the data and creating my PCollection to export to BigQuery, I'd like to insure that I am not importing duplicate records based on this key value. What would be the most efficient approach to doing this in my Java program using Dataflow?

Thanks

Sam McVeety · Accepted Answer

GroupByKey concept in Dataflow allows arbitrary groupings, which can be leveraged to remove duplicate keys from a PCollection.

The most generic approach to this problem would be:

read from your source file, producing a PCollection of input records,
use a ParDo transform to separate keys and values, producing a PCollection of KV,
perform a GroupByKey operation on it, producing a PCollection of KV>,
use a ParDo transform to select which value mapped to the given key should be written, producing PCollection of KV,
use a ParDo transform to format the data for writing,
finally, write the results to BigQuery or any other sink.

Some of these steps may be omitted, if you are solving a particular special case of the generic problem.

In particular, if the entire record is considered a key, the problem can be simplified to just running a Count transform and iterating over the resulting PCollection.

Here's an approximate code example for GroupByKey:

PCollection<KV<String, Doc>> urlDocPairs = ...;
PCollection<KV<String, Iterable<Doc>>> urlToDocs =
    urlDocPairs.apply(GroupByKey.<String, Doc>create());
PCollection<KV<String, Doc>> results = urlToDocs.apply(
    ParDo.of(new DoFn<KV<String, Iterable<Doc>>, KV<String, Doc>>() {
      public void processElement(ProcessContext c) {
        String url = c.element().getKey();
        Iterable<Doc> docsWithThatUrl = c.element().getValue();
        // return a pair of url and an element from Iterable<Doc>.
    }}));

Reza Rokni · Answer

The following might be worth a look

https://cloud.google.com/dataflow/java-sdk/JavaDoc/com/google/cloud/dataflow/sdk/transforms/RemoveDuplicates

Using Dataflow to Remove Duplicates

Tags:

google-cloud-dataflow

Alex Harvey

2 Answers

Sam McVeety

Reza Rokni

Recent Activity

Donate For Us

Using Dataflow to Remove Duplicates

Tags:

google-cloud-dataflow

Alex Harvey

2 Answers

Sam McVeety

Reza Rokni

Related questions

Recent Activity

Donate For Us