How to query and iterate over array of structures in Athena (Presto)?

Tags:

I have a S3 bucket with 500,000+ json records, eg.

{
  "userId": "00000000001",
  "profile": {
    "created": 1539469486,
    "userId": "00000000001",
    "primaryApplicant": {
      "totalSavings": 65000,
      "incomes": [
        { "amount": 5000, "incomeType": "SALARY", "frequency": "FORTNIGHTLY" },
        { "amount": 2000, "incomeType": "OTHER", "frequency": "MONTHLY" }
      ]
    }
  }
}

I created a new table in Athena

CREATE EXTERNAL TABLE profiles (  
  userId string,
  profile struct<
    created:int,
    userId:string,
    primaryApplicant:struct<
      totalSavings:int,
      incomes:array<struct<amount:int,incomeType:string,frequency:string>>,
    >
  >
)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
WITH SERDEPROPERTIES ( 'ignore.malformed.json' = 'true')
LOCATION 's3://profile-data'

I am interested in the incomeTypes, eg. "SALARY", "PENSIONS", "OTHER", etc.. and ran this query changing jsonData.incometype each time:

SELECT jsonData
FROM "sampledb"."profiles"

CROSS JOIN UNNEST(sampledb.profiles.profile.primaryApplicant.incomes) AS la(jsonData)

WHERE jsonData.incometype='SALARY'

This worked fine with CROSS JOIN UNNEST which flattened the incomes array so that the data example above would span across 2 rows. The only idiosyncratic thing was that CROSS JOIN UNNEST made all the field names lowercase, eg. a row looked like this:

{amount=1520, incometype=SALARY, frequency=FORTNIGHTLY}

Now I have been asked how many users have two or more "SALARY" entries, eg.

      "incomes": [
        { "amount": 3000, "incomeType": "SALARY", "frequency": "FORTNIGHTLY" },
        { "amount": 4000, "incomeType": "SALARY", "frequency": "MONTHLY" }
      ],

I'm not sure how to go about this.

How do I query the array of structures to look for duplicate incomeTypes of "SALARY"?
Do I have to iterate over the array?
What should the result look like?

378

asked Mar 15 '19 04:03

tea

1 Answers

UNNEST is a very powerful feature, and it's possible to solve this problem using it. However, I think using Presto's Lambda functions is more straight forward:

SELECT COUNT(*)
FROM sampledb.profiles
WHERE CARDINALITY(FILTER(profile.primaryApplicant.incomes, income -> income.incomeType = 'SALARY')) > 1

This solution uses FILTER on the profile.primaryApplicant.incomes array to get only those with an incomeType of SALARY, and then CARDINALITY to extract the length of that result.

Case sensitivity is never easy with SQL engines. In general I think you should not expect them to respect case, and many don't. Athena in particular explicitly converts column names to lower case.

184

answered Oct 05 '22 23:10

Theo

Related questions
                            
                                SQL Query similar to IN where clause with AND condition instead of OR
                            
                                SQLite speed up select with collate nocase
                            
                                Using an SQL table without an ID column in Haskell/Persistent
                            
                                Group by Concat Teradata
                            
                                Find the 3rd Maximum Salary for each department based on table data
                            
                                MySQL command output too wide in command-line client [duplicate]
                            
                                What is off page in Mysql?
                            
                                Comparing empty string with null value - SQL Server
                            
                                Selecting latest consecutive records that match a condition with PostgreSQL
                            
                                Postgres GROUP BY Array Column
                            
                                Add a new column in table with a sequence - Oracle
                            
                                extract the date from a timestamp value variable in Impala
                            
                                How to do a Select in another Select with Postgresql
                            
                                How to decode BASE64 in Standard SQL?
                            
                                Insert a pandas dataframe into a SQLite table
                            
                                how to select only by date from timestamp column in postgres?
                            
                                Invalid POLYGON bigQuery
                            
                                Updating rows in jOOQ with joins
                            
                                How to change a UNION to a IN clause?
                            
                                How to remove garbage data from array output

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to query and iterate over array of structures in Athena (Presto)?

Tags:

json

sql

amazon-athena

presto

tea

People also ask

1 Answers

Theo

Recent Activity

Donate For Us