Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Extract keys and values from json string in bigquery where there is no specified key in the json document

I have a table in bigquery where I have object and for each object I have some stringified json. In json, an example row look like below:

{
    "ObjectID": "1984931229",
    "indexed_abstract": "{\"IndexLength\":123,\"InvertedIndex\":{\"Twenty-seven\":[0],\"metastatic\":[1,45],\"breast\":[2],\"adenocarcinoma\":[3],\"patients,\":[4]}}" 
}

where inside the indexed_abstract we have an InvertedIndex which contains some keywords and how many times these keywords appeared in the ObjectID.

Now I want to access the stringified json by parsing the json using bigquery and for each ObjectID I want to create a nested field where I have the keyword, the corresponding array and the length of the corresponding array.

For example in this case the output would look like the following:

+------------+----------------+---------------+-------------------+
|  ObjectID  |  keyword.key   | keyword.count | keyword.positions |
+------------+----------------+---------------+-------------------+
| 1984931229 | Twenty-seven   |             1 | [0]               |
|            | metastatic     |             2 | [1,45]            |
|            | breast         |             1 | [2]               |
|            | adenocarcinoma |             1 | [3]               |
|            | patients       |             1 | [4]               |
+------------+----------------+---------------+-------------------+

I understand I could use JSON_EXTRACT function but I am not sure what would be my key inside the inverted index to access the keywords and the arrays corresponding to them.

like image 644
Syed Arefinul Haque Avatar asked Nov 30 '25 01:11

Syed Arefinul Haque


1 Answers

Below is for BigQuery Standard SQL

#standardSQL
SELECT ObjectID, 
  ARRAY(
    SELECT AS STRUCT 
      key, 
      ARRAY_LENGTH(SPLIT(value)) `count`, 
      value positions 
    FROM UNNEST(REGEXP_EXTRACT_ALL(JSON_EXTRACT(indexed_abstract, '$.InvertedIndex'), r'"[^"]+":\[[\d,]*?]')) pair,
    UNNEST([STRUCT(REPLACE(SPLIT(pair, ':')[OFFSET(0)], '"', '') AS key, SPLIT(pair, ':')[OFFSET(1)] AS value)])
  ) keyword
FROM `project.dataset.table`

If to apply to sample data from your question - result is

Row ObjectID    keyword.key     keyword.count   keyword.positions    
1   1984931229  Twenty-seven    1               [0]  
                metastatic      2               [1,45]   
                breast          1               [2]  
                adenocarcinoma  1               [3]  
                patients        1               [4]  

Update on Op's comment - I was wondering if I wanted to make the positions an array (a repeated field), how would I do that?

Change needs to be done in just one line

  SPLIT(REGEXP_REPLACE(value, r'\[|]', '')) positions 
like image 167
Mikhail Berlyant Avatar answered Dec 02 '25 16:12

Mikhail Berlyant