Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Reading BigQuery federated table as source in Dataflow throws an error

I have a federated source in BigQuery which is pointing to some CSV files in GCS.

When I try to read to the federated BigQuery table as a source for a Dataflow pipeline, it throws the following error:

    1226 [main] ERROR com.google.cloud.dataflow.sdk.util.BigQueryTableRowIterator  - Error reading from BigQuery table Federated_test_dataflow of dataset CPT_7414_PLAYGROUND : 400 Bad Request
{
  "code" : 400,
  "errors" : [ {
    "domain" : "global",
    "message" : "Cannot list a table of type EXTERNAL.",
    "reason" : "invalid"
  } ],
  "message" : "Cannot list a table of type EXTERNAL."
}

Does Dataflow not support federated sources in BigQuery, or am I doing something wrong? I do know that I could read the files from GCS directly into my pipeline, but I'd prefer to work with BigQuery TableRow objects instead due to the design of the application.

 PCollection<TableRow> results = pipeline.apply("fed-test", BigQueryIO.Read.from("<project_id>:CPT_7414_PLAYGROUND.Federated_test_dataflow")).apply(ParDo.of(new DoFn<TableRow, TableRow>() {
        @Override
        public void processElement(ProcessContext c) throws Exception {
            System.out.println(c.element());
        }
    }));
like image 274
Graham Polley Avatar asked Jan 07 '23 09:01

Graham Polley


2 Answers

As Michael says, BigQuery does not support directly reading from EXTERNAL (federated tables) or VIEWs: even reading effectively takes a query.

To read from these tables in Dataflow, you can instead use

BigQueryIO.Read.fromQuery("SELECT * FROM table_or_view_name")

which will issue the query and save the result to a temporary table, and then begin the read process. Of course, this will incur the costs of querying on BigQuery, so if you wish to read from the same VIEW or EXTERNAL table repeatedly you may want to manually create the table.

like image 121
Dan Halperin Avatar answered Jan 18 '23 23:01

Dan Halperin


The Dataflow BigQuery source was designed to read BigQuery managed tables of type "TABLE". (The type definition can be found at https://cloud.google.com/bigquery/docs/reference/v2/tables#type.) EXTERNAL and VIEW tables are not supported.

The BigQuery "federated table" feature allows bigquery to directly query data in places like Google Cloud Storage. Dataflow can also read files from Google Cloud Storage, so you should be able to point your Dataflow computation directly at the sources you want to read.

like image 45
Michael Sheldon Avatar answered Jan 19 '23 00:01

Michael Sheldon