Are there any tools that will validate a JSON string against a BigQuery schema? I'd like to load valid ones to BQ, and re-process invalid ones.
I know that you can validate against a standard JSON schema using (e.g.) python's jsonschema, is there something similar for BQ schemas?
Re Pentium10's comment, I can imagine a number of ETL scenarios where data from several sources has to be assembled such that it matches a BQ schema - currently I need 2 schemas for the data, a JSON Schema, and a BQ schema - I validate against the JSON schema and hope that this is enough to satisfy the BQ schema on submission.
Specifically: in this situation, I have JSON which has arrived from a javascript front end, and been entered into BQ as a string. I want to process this field, and add it to BQ as a table in its own right, so that I can search it.
The JSON (more or less) falls into 2 'schemas', but it is poorly TYPED ( i.e. numbers are treated as strings, lists of length 1 are strings, not lists...). I want a quick way to see whether a field would go into the table, and it seemed a little silly that I have a BQ table schema, but cannot validate against it - rather, I must also create a JSON schema for the idealised data and must check against that.
BigQuery natively supports JSON data using the JSON data type.
To enable schema auto-detection when loading data, use one of these approaches: In the Google Cloud console, in the Schema section, for Auto detect, check the Schema and input parameters option. In the bq command-line tool, use the bq load command with the --autodetect parameter.
Loading JSON data into a new table. In the Google Cloud console, go to the BigQuery page. In the Explorer pane, expand your project, and then select a dataset. In the Dataset info section, click add_box Create table.
I would suggest that you use your JSON schema as a JSON object in Python, with this you could try to validate the schema using BigQuery's library.
1 - Request the Schema out of a BigQuery Table (should be then dynamically implemented):
from google.cloud import bigquery
client = bigquery.Client(project='your_project')
dataset_ref = client.dataset('your_dataset')
table_ref = dataset_ref.table('your_table_name')
table_helper = client.get_table(table_ref)
2 - Get the schema and format it as a JSON, after it you should be able to compare the two schemas.
What you have now is a list containing SchemaField()
your_schema = table_helper.schema
You could try to format a list and then dump it into a JSON object...
formatted_list_schema = ["'{0}','{1}','{2}',{3},{4}".format(schema.name,schema.field_type,schema.mode,schema.description,schema.fields) for schema in table_helper.schema]
json_bq_schema = json.dumps(formatted_list_schema)
You could try to format that BQ-JSON-Schema in order to compare it as they do it here: How to compare two JSON objects with the same elements in a different order equal?
I know that this is not a solution easy to implement, but I guess if you tweak it good enough, it will be robust and can solve your problem. Feel free to ask if I can help you more...
Check for more info about schemas https://cloud.google.com/bigquery/docs/schemas
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With