I'm testing some pipeline on a small set of data and then suddenly my pipeline breaks down during one of the test runs with this message: Not found: Dataset thijs-dev:nlthijs_ba was not found in location US
INFO:root:Using location 'EU' from table
My run scriptpython pippeline/main.py --project thijs-dev --region europe-west4 --runner DataflowRunner --temp_location gs://thijs/dataflow/tmp --staging_location gs://thijs/dataflow/staging --job_name thijspipe --save_main_session --setup_file pipeline/setup.py --autoscaling_algorithm=THROUGHPUT_BASED --max_num_workers=7
My failing stepthijs = (p | 'ReadTable thijs' >> beam.io.Read(beam.io.BigQuerySource(query=queries.load_code_table(), use_standard_sql=True)))
Example what my query looks like
#standardSQL
select
original.c1,
original.c2,
original.c3
from `thijs.tablename` original
inner join (
select c1, max(c2) as col2 from `thijs.tablename` group by c2)
timejoin on timejoin.c5 = original.c5 and timejoin.c2 = original.c2
My question is: what is going wrong exactly, where is this US coming from?
The errorRuntimeError: apitools.base.py.exceptions.HttpNotFoundError: HttpError accessing <https://www.googleapis.com/bigquery/v2/projects/thijs-dev/jobs?alt=json>: response: <{'vary': 'Origin, X-Origin, Referer', 'content-type': 'application/json; charset=UTF-8', 'date': 'Sun, 16 Feb 2020 09:40:10 GMT', 'server': 'ESF', 'cache-control': 'private', 'x-xss-protection': '0', 'x-frame-options': 'SAMEORIGIN', 'x-content-type-options': 'nosniff', 'transfer-encoding': 'chunked', 'status': '404', 'content-length': '338', '-content-encoding': 'gzip'}>, content <{ "error": { "code": 404, "message": "Not found: Dataset thijs-dev:`nlthijs_ba was not found in location US", "errors": [ { "message": "Not found: Dataset thijs-dev:`nlthijs_ba was not found in location US", "domain": "global", "reason": "notFound" } ], "status": "NOT_FOUND" } } > [while running 'Transform Details Thijs']
[update]
Here you can see that I forced standardsql by using #standardsql as first line in my queries. But somewhere some API is forcing legacy SQL and I don't know what or where.
RuntimeError: apitools.base.py.exceptions.HttpBadRequestError: HttpError accessing <https://www.googleapis.com/bigquery/v2/projects/thijs-dev/jobs?alt=json>: response: <{'vary': 'Origin, X-Origin, Referer', 'content-type': 'application/json; charset=UTF-8', 'date': 'Sun, 16 Feb 2020 20:59:12 GMT', 'server': 'ESF', 'cache-control': 'private', 'x-xss-protection': '0', 'x-frame-options': 'SAMEORIGIN', 'x-content-type-options': 'nosniff', 'transfer-encoding': 'chunked', 'status': '400', 'content-length': '354', '-content-encoding': 'gzip'}>, content <{ "error": { "code": 400, "message": "Query text specifies use_legacy_sql:false, while API options specify:true", "errors": [ { "message": "Query text specifies use_legacy_sql:false, while API options specify:true", "domain": "global", "reason": "invalid" } ], "status": "INVALID_ARGUMENT" } } > [while running 'pipeline']
Python SDK 2.16.0 & 2.19.0
It's always recommeded that you will work with Google BigQuery, you should put the complete name of the BigQuery resource. What do I mean? Using the same example in your code:
select
original.c1,
original.c2,
original.c3 from `thijs.tablename` original
inner join ( elect c1, max(c2) as col2 from `thijs.tablename` group by c2)
timejoin on timejoin.c5 = original.c5 and timejoin.c2 = original.c2
Instead of putting only: thijs.tablename
, you should use thijs-dev.nlthijs_ba.table_name
In that way, you make sure that you are calling the right resource on BigQuery.
Another thing you should check first is if you can query that table with a simple query using Python first:
select c1, max(c2) as col2 from
thijs.tablename
group by c2
If the call of this query fails, you should check that first.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With