Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Dataflow job fails and tries to create temp_dataset on Bigquery

I'm running a simple dataflow job to read data from a table and write back to another. The job fails with the error:

Workflow failed. Causes: S01:ReadFromBQ+WriteToBigQuery/WriteToBigQuery/NativeWrite failed., BigQuery creating dataset "_dataflow_temp_dataset_18172136482196219053" in project "[my project]" failed., BigQuery execution failed., Error: Message: Access Denied: Project [my project]: User does not have bigquery.datasets.create permission in project [my project].

I'm not trying to create any dataset though, it's basically trying to create a temp_dataset because the job fails. But I dont get any information on the real error behind the scene. The reading isn't the issue, it's really the writing step that fails. I don't think it's related to permissions but my question is more about how to get the real error rather than this one. Any idea of how to work with this issue ?

Here's the code:

import apache_beam as beam
from apache_beam.options.pipeline_options import PipelineOptions, GoogleCloudOptions, StandardOptions, WorkerOptions
from sys import argv

options = PipelineOptions(flags=argv)
google_cloud_options = options.view_as(GoogleCloudOptions)
google_cloud_options.project = "prj"
google_cloud_options.job_name = 'test'
google_cloud_options.service_account_email = "mysa"
google_cloud_options.staging_location = 'gs://'
google_cloud_options.temp_location = 'gs://'
options.view_as(StandardOptions).runner = 'DataflowRunner'
worker_options = options.view_as(WorkerOptions)
worker_options.subnetwork = 'subnet'

with beam.Pipeline(options=options) as p:
    query = "SELECT ..."

    bq_source = beam.io.BigQuerySource(query=query, use_standard_sql=True)

    bq_data = p | "ReadFromBQ" >> beam.io.Read(bq_source)

    table_schema = ...
    bq_data | beam.io.WriteToBigQuery(
        project="prj",
        dataset="test",
        table="test",
        schema=table_schema,
        create_disposition=beam.io.BigQueryDisposition.CREATE_IF_NEEDED,
        write_disposition=beam.io.BigQueryDisposition.WRITE_APPEND
    )
like image 502
Alex Avatar asked Dec 11 '25 07:12

Alex


2 Answers

When using the BigQuerySource the SDK creates a temporary dataset and stores the output of the query into a temporary table. It then issues an export from that temporary table to read the results from.

So it is expected behavior for it to create this temp_dataset. This means that it is probably not hiding an error.

This is not very well documented but can be seen in the implementation of the BigQuerySource by following the read call: BigQuerySource.reader() --> BigQueryReader() --> BigQueryReader().__iter__() --> BigQueryWrapper.run_query() --> BigQueryWrapper._start_query_job().

like image 127
Cubez Avatar answered Dec 13 '25 20:12

Cubez


You can specify the dataset to use. That way the process doesn't create a temp dataset. Example:

   TypedRead<TableRow> read = BigQueryIO.readTableRowsWithSchema()
    .fromQuery("selectQuery").withQueryTempDataset("existingDataset")
    .usingStandardSql().withMethod(TypedRead.Method.DEFAULT); 
like image 20
Dakshin Rajavel Avatar answered Dec 13 '25 21:12

Dakshin Rajavel



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!