Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to query and iterate over array of structures in Athena (Presto)?

I have a S3 bucket with 500,000+ json records, eg.

{
  "userId": "00000000001",
  "profile": {
    "created": 1539469486,
    "userId": "00000000001",
    "primaryApplicant": {
      "totalSavings": 65000,
      "incomes": [
        { "amount": 5000, "incomeType": "SALARY", "frequency": "FORTNIGHTLY" },
        { "amount": 2000, "incomeType": "OTHER", "frequency": "MONTHLY" }
      ]
    }
  }
}

I created a new table in Athena

CREATE EXTERNAL TABLE profiles (  
  userId string,
  profile struct<
    created:int,
    userId:string,
    primaryApplicant:struct<
      totalSavings:int,
      incomes:array<struct<amount:int,incomeType:string,frequency:string>>,
    >
  >
)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
WITH SERDEPROPERTIES ( 'ignore.malformed.json' = 'true')
LOCATION 's3://profile-data'

I am interested in the incomeTypes, eg. "SALARY", "PENSIONS", "OTHER", etc.. and ran this query changing jsonData.incometype each time:

SELECT jsonData
FROM "sampledb"."profiles"

CROSS JOIN UNNEST(sampledb.profiles.profile.primaryApplicant.incomes) AS la(jsonData)

WHERE jsonData.incometype='SALARY'

This worked fine with CROSS JOIN UNNEST which flattened the incomes array so that the data example above would span across 2 rows. The only idiosyncratic thing was that CROSS JOIN UNNEST made all the field names lowercase, eg. a row looked like this:

{amount=1520, incometype=SALARY, frequency=FORTNIGHTLY}

Now I have been asked how many users have two or more "SALARY" entries, eg.

      "incomes": [
        { "amount": 3000, "incomeType": "SALARY", "frequency": "FORTNIGHTLY" },
        { "amount": 4000, "incomeType": "SALARY", "frequency": "MONTHLY" }
      ],

I'm not sure how to go about this.

  1. How do I query the array of structures to look for duplicate incomeTypes of "SALARY"?

  2. Do I have to iterate over the array?

  3. What should the result look like?

like image 378
tea Avatar asked Mar 15 '19 04:03

tea


People also ask

How do I run multiple Athena queries?

Open the Amazon Athena console at https://console.aws.amazon.com/athena/ . In the left navigation pane, choose Workflows. In the Execute multiple queries tile, choose Get started. In the Get started dialog box, choose Deploy a sample project, and then choose Continue.

How do you query a map in Athena?

Maps are key-value pairs that consist of data types available in Athena. To create maps, use the MAP operator and pass it two arrays: the first is the column (key) names, and the second is values. All values in the arrays must be of the same type.

Can Athena query across databases?

With the federated query functionality in Athena, you can now run SQL queries across data stored in relational, non-relational, object, and custom data sources and store the results back in Amazon S3 for further analysis.


1 Answers

UNNEST is a very powerful feature, and it's possible to solve this problem using it. However, I think using Presto's Lambda functions is more straight forward:

SELECT COUNT(*)
FROM sampledb.profiles
WHERE CARDINALITY(FILTER(profile.primaryApplicant.incomes, income -> income.incomeType = 'SALARY')) > 1

This solution uses FILTER on the profile.primaryApplicant.incomes array to get only those with an incomeType of SALARY, and then CARDINALITY to extract the length of that result.


Case sensitivity is never easy with SQL engines. In general I think you should not expect them to respect case, and many don't. Athena in particular explicitly converts column names to lower case.

like image 184
Theo Avatar answered Oct 05 '22 23:10

Theo