Dataflow job fails and tries to create temp_dataset on Bigquery

Question

I'm running a simple dataflow job to read data from a table and write back to another. The job fails with the error:

Workflow failed. Causes: S01:ReadFromBQ+WriteToBigQuery/WriteToBigQuery/NativeWrite failed., BigQuery creating dataset "_dataflow_temp_dataset_18172136482196219053" in project "[my project]" failed., BigQuery execution failed., Error: Message: Access Denied: Project [my project]: User does not have bigquery.datasets.create permission in project [my project].

I'm not trying to create any dataset though, it's basically trying to create a temp_dataset because the job fails. But I dont get any information on the real error behind the scene. The reading isn't the issue, it's really the writing step that fails. I don't think it's related to permissions but my question is more about how to get the real error rather than this one. Any idea of how to work with this issue ?

Here's the code:

import apache_beam as beam
from apache_beam.options.pipeline_options import PipelineOptions, GoogleCloudOptions, StandardOptions, WorkerOptions
from sys import argv

options = PipelineOptions(flags=argv)
google_cloud_options = options.view_as(GoogleCloudOptions)
google_cloud_options.project = "prj"
google_cloud_options.job_name = 'test'
google_cloud_options.service_account_email = "mysa"
google_cloud_options.staging_location = 'gs://'
google_cloud_options.temp_location = 'gs://'
options.view_as(StandardOptions).runner = 'DataflowRunner'
worker_options = options.view_as(WorkerOptions)
worker_options.subnetwork = 'subnet'

with beam.Pipeline(options=options) as p:
    query = "SELECT ..."

    bq_source = beam.io.BigQuerySource(query=query, use_standard_sql=True)

    bq_data = p | "ReadFromBQ" >> beam.io.Read(bq_source)

    table_schema = ...
    bq_data | beam.io.WriteToBigQuery(
        project="prj",
        dataset="test",
        table="test",
        schema=table_schema,
        create_disposition=beam.io.BigQueryDisposition.CREATE_IF_NEEDED,
        write_disposition=beam.io.BigQueryDisposition.WRITE_APPEND
    )

Cubez · Accepted Answer

When using the BigQuerySource the SDK creates a temporary dataset and stores the output of the query into a temporary table. It then issues an export from that temporary table to read the results from.

So it is expected behavior for it to create this temp_dataset. This means that it is probably not hiding an error.

This is not very well documented but can be seen in the implementation of the BigQuerySource by following the read call: BigQuerySource.reader() --> BigQueryReader() --> BigQueryReader().__iter__() --> BigQueryWrapper.run_query() --> BigQueryWrapper._start_query_job().

Dakshin Rajavel · Answer

You can specify the dataset to use. That way the process doesn't create a temp dataset. Example:

   TypedRead<TableRow> read = BigQueryIO.readTableRowsWithSchema()
    .fromQuery("selectQuery").withQueryTempDataset("existingDataset")
    .usingStandardSql().withMethod(TypedRead.Method.DEFAULT);

Dataflow job fails and tries to create temp_dataset on Bigquery

Tags:

google-bigquery

apache-beam

google-cloud-dataflow

Alex

2 Answers

Cubez

Dakshin Rajavel

Recent Activity

Donate For Us

Dataflow job fails and tries to create temp_dataset on Bigquery

Tags:

google-bigquery

apache-beam

google-cloud-dataflow

Alex

2 Answers

Cubez

Dakshin Rajavel

Related questions

Recent Activity

Donate For Us