Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

BigQuery JSON schema validation

Are there any tools that will validate a JSON string against a BigQuery schema? I'd like to load valid ones to BQ, and re-process invalid ones.

I know that you can validate against a standard JSON schema using (e.g.) python's jsonschema, is there something similar for BQ schemas?


Re Pentium10's comment, I can imagine a number of ETL scenarios where data from several sources has to be assembled such that it matches a BQ schema - currently I need 2 schemas for the data, a JSON Schema, and a BQ schema - I validate against the JSON schema and hope that this is enough to satisfy the BQ schema on submission.


Specifically: in this situation, I have JSON which has arrived from a javascript front end, and been entered into BQ as a string. I want to process this field, and add it to BQ as a table in its own right, so that I can search it.

The JSON (more or less) falls into 2 'schemas', but it is poorly TYPED ( i.e. numbers are treated as strings, lists of length 1 are strings, not lists...). I want a quick way to see whether a field would go into the table, and it seemed a little silly that I have a BQ table schema, but cannot validate against it - rather, I must also create a JSON schema for the idealised data and must check against that.

like image 720
Jonathan Miller Avatar asked Jul 31 '15 10:07

Jonathan Miller


People also ask

Can BigQuery store JSON data?

BigQuery natively supports JSON data using the JSON data type.

How do I auto detect schema in BigQuery?

To enable schema auto-detection when loading data, use one of these approaches: In the Google Cloud console, in the Schema section, for Auto detect, check the Schema and input parameters option. In the bq command-line tool, use the bq load command with the --autodetect parameter.

How do I load a JSON file into BigQuery?

Loading JSON data into a new table. In the Google Cloud console, go to the BigQuery page. In the Explorer pane, expand your project, and then select a dataset. In the Dataset info section, click add_box Create table.


1 Answers

I would suggest that you use your JSON schema as a JSON object in Python, with this you could try to validate the schema using BigQuery's library.

1 - Request the Schema out of a BigQuery Table (should be then dynamically implemented):

from google.cloud import bigquery
client = bigquery.Client(project='your_project')
dataset_ref = client.dataset('your_dataset')
table_ref = dataset_ref.table('your_table_name')
table_helper = client.get_table(table_ref)

2 - Get the schema and format it as a JSON, after it you should be able to compare the two schemas.

What you have now is a list containing SchemaField()

your_schema = table_helper.schema

You could try to format a list and then dump it into a JSON object...

formatted_list_schema = ["'{0}','{1}','{2}',{3},{4}".format(schema.name,schema.field_type,schema.mode,schema.description,schema.fields) for schema in table_helper.schema]

json_bq_schema = json.dumps(formatted_list_schema)

You could try to format that BQ-JSON-Schema in order to compare it as they do it here: How to compare two JSON objects with the same elements in a different order equal?

I know that this is not a solution easy to implement, but I guess if you tweak it good enough, it will be robust and can solve your problem. Feel free to ask if I can help you more...

Check for more info about schemas https://cloud.google.com/bigquery/docs/schemas

like image 61
Cami Fandino Avatar answered Sep 18 '22 02:09

Cami Fandino