I'm trying to find rows with duplicate fields in an array of structs within a Google BigQuery table, using the new Standard SQL. The data in the table (simplified) where each row looks a bit like this:
{
  "Session": "abc123",
  "Information" [
    {
      "Identifier": "e8d971a4-ef33-4ea1-8627-f1213e4c67dc"
    },
    {
      "Identifier": "1c62813f-7ec4-4968-b18b-d1eb8f4d9d26"
    },
    {
      "Identifier": "e8d971a4-ef33-4ea1-8627-f1213e4c67dc"
    }
  ]
}
My end goal is to display the rows that have Information entities with duplicate Identifier values present. However, most of the queries I attempt get an error message of the following form:
Cannot access field Identifier on a value with type ARRAY<STRUCT<Identifier STRING>>
Is there a way to work with the data inside of a STRUCT within an ARRAY?
Here's my first attempt at a query:
SELECT
  Session,
  Information
FROM
  `events.myevents`
WHERE
  COUNT(DISTINCT Information.Identifier) != ARRAY_LENGTH(Information.Identifier)
LIMIT
  1000
And another using a subquery:
SELECT
  Session,
  Information
FROM (
  SELECT
    Session,
    Information,
    COUNT(DISTINCT Information.Identifier) AS info_count_distinct,
    ARRAY_LENGTH(Information) AS info_count
  FROM
    `events.myevents`
  WHERE
    COUNT(DISTINCT Information.Identifier) != ARRAY_LENGTH(Information.Identifier)
  LIMIT
    1000)
WHERE
  info_count != info_count_distinct
Try below
SELECT Session, Identifier, COUNT(1) AS dups
FROM `events.myevents`, UNNEST(Information)
GROUP BY Session, Identifier
HAVING dups > 1
ORDER BY Session
Should give you what you expect plus number of dups.
Like below (example)   
Session Identifier                              dups     
abc123  e8d971a4-ef33-4ea1-8627-f1213e4c67dc    2    
abc345  1c62813f-7ec4-4968-b18b-d1eb8f4d9d26    3    
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With