Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

BigQuery: Store semi-structured JSON data

I have data which can have varying json keys, I want to store all of this data in bigquery and then explore the available fields later.

My structure will be like so:

[
{id: 1111, data: {a:27, b:62, c: 'string'} },
{id: 2222, data: {a:27, c: 'string'} },
{id: 3333, data: {a:27} },
{id: 4444, data: {a:27, b:62, c:'string'} },
]

I wanted to use a STRUCT type but it seems all the fields need to be declared?

I then want to be able to query and see how often each key appears, and basically run queries over all records with for example a key as though it was in its own column.

Side note: this data is coming from URL query strings, maybe someone thinks it is best to push the full url and the use the functions to run analysis?

like image 854
dendog Avatar asked Jun 09 '26 13:06

dendog


1 Answers

There are two primary methods for storing semi-structured data as you have in your example:

Option #1: Store JSON String

You can store the data field as a JSON string, and then use the JSON_EXTRACT function to pull out the values it can find, and it will return NULL for any value it cannot find.

Since you mentioned needing to do mathematical analysis on the fields, let's do a simple SUM for the values of a and b:

# Creating an example table using the WITH statement, this would not be needed
# for a real table.
WITH records AS (
  SELECT 1111 AS id, "{\"a\":27, \"b\":62, \"c\": \"string\"}" as data
  UNION ALL
  SELECT 2222 AS id, "{\"a\":27, \"c\": \"string\"}" as data
  UNION ALL
  SELECT 3333 AS id, "{\"a\":27}" as data
  UNION ALL
  SELECT 4444 AS id, "{\"a\":27, \"b\":62, \"c\": \"string\"}" as data
)

# Example Query
SELECT SUM(aValue) AS aSum, SUM(bValue) AS bSum FROM (
  SELECT id, 
    CAST(JSON_EXTRACT(data, "$.a") AS INT64) AS aValue, # Extract & cast as an INT
    CAST(JSON_EXTRACT(data, "$.b") AS INT64) AS bValue  # Extract & cast as an INT
  FROM records
)

# results
# Row | aSum | bSum
# 1   | 108  | 124

There are some pros and cons to this approach:

Pros

  • The syntax is fairly straight forward
  • Less error prone

Cons

  • Store costs will be slightly higher since you have to store all the characters to serialize to JSON.
  • Queries will run slower than using pure native SQL.

Option #2: Repeated Fields

BigQuery has support for repeated fields, allowing you to take your structure and express it natively in SQL.

Using the same example, here is how we would do that:

## Using a with to create a sample table
WITH records AS (SELECT * FROM UNNEST(ARRAY<STRUCT<id INT64, data ARRAY<STRUCT<key STRING, value STRING>>>>[
  (1111, [("a","27"),("b","62"),("c","string")]),
  (2222, [("a","27"),("c","string")]),
  (3333, [("a","27")]),
  (4444, [("a","27"),("b","62"),("c","string")])
])),
## Using another WITH table to take records and unnest them to be joined later
recordsUnnested AS (
  SELECT id, key, value
  FROM records, UNNEST(records.data) AS keyVals
)

SELECT SUM(aValue) AS aSum, SUM(bValue) AS bSum
FROM (
  SELECT R.id, CAST(RA.value AS INT64) AS aValue, CAST(RB.value AS INT64) AS bValue
  FROM records R
    LEFT JOIN recordsUnnested RA ON R.id = RA.id AND RA.key = "a"
    LEFT JOIN recordsUnnested RB ON R.id = RB.id AND RB.key = "b"
)

# results
# Row | aSum | bSum
# 1   | 108  | 124

As you can see, to perform a similar, it is still rather complex. You also still have to store items like strings and CAST them to other values when necessary, since you cannot mix types in a repeated field.

Pros

  • Store size will be less than JSON
  • Queries will typically execute faster.

Cons

  • The syntax is more complex, not as straight forward

Hope that helps, good luck.

like image 101
Justin Carmony Avatar answered Jun 12 '26 11:06

Justin Carmony