Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Athena: Query exhausted resources at scale factor

I am running a query like:

SELECT f.*, p.countryName, p.airportName, a.name AS agentName
FROM (
    SELECT 
        f.outboundlegid, 
        f.inboundlegid,
        f.querydatetime,
        cast(f.agent as bigint) as agent,
        cast(f.querydestinationplace as bigint) as querydestinationplace,
        f.queryoutbounddate,
        f.queryinbounddate,
        f.quoteageinminutes,
        f.price
    FROM flights f
    WHERE querydatetime >= '2018-01-02'
    AND querydatetime <= '2019-01-10'
) f
INNER JOIN (
  SELECT airportId, airportName, countryName
  FROM airports
  WHERE countryName IN ('Philippines', 'Indonesia', 'Malaysia', 'Hong Kong', 'Thailand', 'Vietnam')
) p
ON f.querydestinationplace = p.airportId
INNER JOIN agents a
ON f.agent = a.id
ORDER BY f.outboundlegid, f.inboundlegid, f.agent, querydatetime DESC

What's wrong with it? Or how can I optimize it? It gives me

Query exhausted resources at this scale factor

I have a flights table and I want to query for flights inside a specific country

like image 938
Jiew Meng Avatar asked Jan 26 '19 05:01

Jiew Meng


People also ask

How do you make Athena queries faster?

Compress and Split Files You can speed up your queries dramatically by compressing your data, provided that files are splittable or of an optimal size (optimal S3 file size is between 200MB-1GB). Smaller data sizes mean less network traffic between Amazon S3 to Athena.

How many Athena queries can run in parallel?

The Athena service limits allow you to submit up to 20 queries by default. Athena then processes these as soon as possible. What you're seeing is typical behaviour, from my experience. The point of being able to submit 20 is that the queries will be executed as soon as possible.

Does Athena cache query results?

No, Athena doesn't support query caching.


1 Answers

I have been facing this problem since the begining of Athena, the problem is the ORDER BY clause. Athena is just an EMR cluster with hive and prestodb installed. The problem you are facing is: Even if your query is distributed across X numbers of nodes, the ordering phase must be done by just a single node, the master node in this case. So at the end, you can order as much data as memory have the master node.

You can test it by reducing the amount of data the query returns maybe reducing the time range.

like image 114
Roberto Avatar answered Sep 23 '22 00:09

Roberto