Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

BigQuery: Create column of JSON datatype

I am trying to load json with the following schema into BigQuery:

{
key_a:value_a,
key_b:{
   key_c:value_c,
   key_d:value_d
  }
key_e:{
   key_f:value_f,
   key_g:value_g
  }
}

The keys under key_e are dynamic, ie in one response key_e will contain key_f and key_g and for another response it will instead contain key_h and key_i. New keys can be created at any time so I cannot create a record with nullable fields for all possible keys.

Instead I want to create a column with JSON datatype that can then be queried using the JSON_EXTRACT() function. I have tried loading key_e as a column with STRING datatype but value_e is analysed as JSON and so fails.

How can I load a section of JSON into a single BigQuery column when there is no JSON datatype?

like image 205
NathanC Avatar asked Jun 06 '16 14:06

NathanC


2 Answers

Having your JSON as a single string column inside BigQuery is definitelly an option. If you have large volume of data this can end up with high query price as all your data will end up in one column and actually querying logic can become quite messy.

If you have luxury of slightly changing your "design" - I would recommend considering below one - here you can employ REPEATED mode

Table schema:

[
  { "name": "key_a",
    "type": "STRING" },
  { "name": "key_b",
    "type": "RECORD",
    "mode": "REPEATED",
    "fields": [
      { "name": "key",
        "type": "STRING"},
      { "name": "value",
        "type": "STRING"}
    ]
  },
  { "name": "key_e",
    "type": "RECORD",
    "mode": "REPEATED",
    "fields": [
      { "name": "key",
        "type": "STRING"},
      { "name": "value",
        "type": "STRING"}
    ]
  }
]

Example of JSON to load

{"key_a": "value_a1", "key_b": [{"key": "key_c", "value": "value_c"}, {"key": "key_d", "value": "value_d"}], "key_e": [{"key": "key_f", "value": "value_f"}, {"key": "key_g", "value": "value_g"}]}
{"key_a": "value_a2", "key_b": [{"key": "key_x", "value": "value_x"}, {"key": "key_y", "value": "value_y"}], "key_e": [{"key": "key_h", "value": "value_h"}, {"key": "key_i", "value": "value_i"}]}

Please note: it should be newline delimited JSON so each row must be on one line

like image 124
Mikhail Berlyant Avatar answered Sep 27 '22 23:09

Mikhail Berlyant


You can't do this directly with BigQuery, but you can make it work in two passes:

(1) Import your JSON data as a CSV file with a single string column.

(2) Transform each row to pack your "any-type" field into a string. Write a UDF that takes a string and emits the final set of columns you would like. Append the output of this query to your target table.

Example

I'll start with some JSON:

{"a": 0, "b": "zero", "c": { "woodchuck": "a"}}
{"a": 1, "b": "one", "c": { "chipmunk": "b"}}
{"a": 2, "b": "two", "c": { "squirrel": "c"}}
{"a": 3, "b": "three", "c": { "chinchilla": "d"}}
{"a": 4, "b": "four", "c": { "capybara": "e"}}
{"a": 5, "b": "five", "c": { "housemouse": "f"}}
{"a": 6, "b": "six", "c": { "molerat": "g"}}
{"a": 7, "b": "seven", "c": { "marmot": "h"}}
{"a": 8, "b": "eight", "c": { "badger": "i"}}

Import it into BigQuery as a CSV with a single STRING column (I called it 'blob'). I had to set the delimiter character to something arbitrary and unlikely (thorn -- 'þ') or it tripped over the default ','.

Verify your table imported correctly. You should see your simple one-column schema and the preview should look just like your source file.

Next, we write a query to transform it into your desired shape. For this example, we'd like the following schema:

a (INTEGER)
b (STRING)
c (STRING -- packed JSON)

We can do this with a UDF:

// Map a JSON string column ('blob') => { a (integer), b (string), c (json-string) }
bigquery.defineFunction(
  'extractAndRepack',                // Name of the function exported to SQL
  ['blob'],                          // Names of input columns
  [{'name': 'a', 'type': 'integer'}, // Output schema
   {'name': 'b', 'type': 'string'},
   {'name': 'c', 'type': 'string'}],
  function (row, emit) {
    var parsed = JSON.parse(row.blob);
    var repacked = JSON.stringify(parsed.c);
    emit({a: parsed.a, b: parsed.b, c: repacked});
  }
);

And a corresponding query:

SELECT a, b, c FROM extractAndRepack(JsonAnyKey.raw)

Now you just need to run the query (selecting your desired target table) and you'll have your data in the form you like.

Row a   b       c    
1   0   zero    {"woodchuck":"a"}    
2   1   one     {"chipmunk":"b"}     
3   2   two     {"squirrel":"c"}     
4   3   three   {"chinchilla":"d"}   
5   4   four    {"capybara":"e"}     
6   5   five    {"housemouse":"f"}   
7   6   six     {"molerat":"g"}  
8   7   seven   {"marmot":"h"}   
9   8   eight   {"badger":"i"}   
like image 27
Adam Lydick Avatar answered Sep 28 '22 00:09

Adam Lydick