Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Skipping header rows - is it possible with Cloud DataFlow?

I've created a Pipeline, which reads from a file in GCS, transforms it, and finally writes to a BQ table. The file contains a header row (fields).

Is there any way to programatically set the "number of header rows to skip" like you can do in BQ when loading in?

number of header rows to skip

like image 702
Graham Polley Avatar asked Feb 11 '15 09:02

Graham Polley


1 Answers

This is not currently possible. It sounds like there are two potential requests here:

  • Specifying presence and skip behavior for header lines for a BigQuery import.
  • Specifying that a GCS text source should skip a header line.

Future work on this is tracked in https://issues.apache.org/jira/browse/BEAM-123.

Also, in the meantime, you could add a simple filter to your ParDo code to skip headers. Something like this:

PCollection<X> rows = ...;
PCollection<X> nonHeaders =
   rows.apply(Filter.by(new MatchIfNonHeader()));
like image 68
Sam McVeety Avatar answered Oct 14 '22 00:10

Sam McVeety