Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Nested Query Alternatives in AWS Athena

I am running a query that gives a non-overlapping set of first_party_id's - ids that are associated with one third party but not another. This query does not run in Athena, however, giving the error: Correlated queries not yet supported.

Was looking at prestodb docs, https://prestodb.io/docs/current/sql/select.html (Athena is prestodb under the hood), for an alternative to nested queries. The with statement example given doesn't seem to translate well for this not in clause. Wondering what the alternative to a nested query would be - Query below.

SELECT 
         COUNT(DISTINCT i.third_party_id) AS uniques
FROM 
         db.ids i
WHERE
         i.third_party_type = 'cookie_1'
         AND i.first_party_id NOT IN (
           SELECT
             i.first_party_id
           WHERE 
             i.third_party_id = 'cookie_2'
         )
like image 528
pauld Avatar asked Jan 24 '17 15:01

pauld


People also ask

Can we run multiple queries in Athena?

Open the Amazon Athena console at https://console.aws.amazon.com/athena/ . In the left navigation pane, choose Workflows. In the Execute multiple queries tile, choose Get started.

What's difference between S3 query and Athena query?

So Whats the Difference Between S3 Select and Athena? S3 Select is a lightweight solution designed to let you use SQL to perform simple SELECT clauses on a maximum of one file. Amazon Athena is an analytics workhorse that allows you to perform SQL on extremely large datasets spanning many files with great performance.

How many Athena queries can run in parallel?

According to Athena's service limits, it cannot build custom user-defined functions (UDFs), write back to S3, or schedule and automate jobs. Amazon places some restrictions on queries: for example, users can only submit one query at a time and can only run up to five simultaneous queries for each account.

Can Athena query unstructured data?

Amazon Athena can process unstructured, semi-structured, and structured data sets. Examples include CSV, JSON, Avro or columnar data formats such as Apache Parquet and Apache ORC.


1 Answers

There may be a better way to do this - I would be curious to see it too! One way I can think of would be to use an outer join. (I'm not exactly sure about how your data is structured, so forgive the contrived example, but I hope it would translate ok.) How about this?

with 
  a as (select * 
       from (values 
       (1,'cookie_n',10,'cookie_2'), 
       (2,'cookie_n',11,'cookie_1'),
       (3,'cookie_m',12,'cookie_1'),
       (4,'cookie_m',12,'cookie_1'),
       (5,'cookie_q',13,'cookie_1'),
       (6,'cookie_n',13,'cookie_1'),
       (7,'cookie_m',14,'cookie_3')
       ) as db_ids(first_party_id, first_party_type, third_party_id, third_party_type)
      ),
  b as (select first_party_type 
        from a where third_party_type = 'cookie_2'),
  c as (select a.third_party_id, b.first_party_type as exclude_first_party_type 
        from a left join b on a.first_party_type = b.first_party_type 
        where a.third_party_type = 'cookie_1')
select count(distinct third_party_id) from c 
       where exclude_first_party_type is null;

Hope this helps!

like image 138
Jeff G Avatar answered Nov 15 '22 19:11

Jeff G