Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Partition data coming from CSV so I can process larger patches rather then individual lines

I am just getting started with Google Data Flow, I have written a simple flow that reads a CSV file from cloud storage. One of the steps involves calling a web service to enrich results. The web service in question performs much better when sending several 100 requests in bulk.

In looking at API I don't see a great way to aggregate 100 elements of a PCollection into a single Par.do Execution. The results would need to be then split to handle the last step of the flow which is writing to a BigQuery table.

Not sure if I need to use windowing is what I want. Most of the windowing examples I see are more geared towards counting over a given time period.

like image 691
Jeffrey Ellin Avatar asked May 11 '15 21:05

Jeffrey Ellin


1 Answers

You can buffer elements in a local member variable of your DoFn, and call your web service when the buffer is large enough, as well as in finishBundle. For example:

class CallServiceFn extends DoFn<String, String> {
  private List<String> elements = new ArrayList<>();

  public void processElement(ProcessContext c) {
    elements.add(c.element());
    if (elements.size() >= MAX_CALL_SIZE) {
      for (String result : callServiceWithData(elements)) {
        c.output(result);
      }
      elements.clear();
    }
  }

  public void finishBundle(Context c) {
    for (String result : callServiceWithData(elements)) {
      c.output(result);
    }
  }
}
like image 99
danielm Avatar answered Dec 04 '22 07:12

danielm