I have a table
Name pets
--------------
Andy {dog:2, cat:1, bird:4}
John {tiger:3, elephant:1, fish:2}
Mary {dog:2, pig:2}
I want the find the pet type with maximum count for each person. In the event of tie, duplicate the row for each pet. The result should look like this:
Name max_pet
------------------
Andy bird
John tiger
Mary dog
Mary pig
Currently, I exported the table and do it in python. But I am wondering can I achieve this using Presto/SQL query? Thanks!
There are several ways to do this. One approach is to use UNNEST
to convert the map into rows, with one row per map entry. You can then use the rank()
window function to assign a ranking to the pets for each name, after which you select only the top ranked item(s).
WITH people (name, pets) AS (
VALUES
('Andy', map_from_entries(array[('dog', 2), ('cat', 1), ('bird', 4)])),
('John', map_from_entries(array[('tiger', 3), ('elephant', 1), ('fish', 2)])),
('Mary', map_from_entries(array[('dog', 2), ('pig', 2)]))
)
SELECT name, pet AS max_pet
FROM (
SELECT name, pet, count,
rank() OVER (PARTITION BY name ORDER BY count DESC) rnk
FROM people
CROSS JOIN UNNEST(pets) AS t (pet, count)
)
WHERE rnk = 1;
name | max_pet
------+---------
Andy | bird
John | tiger
Mary | dog
Mary | pig
(4 rows)
Using UNNEST
is simple to understand, but doesn't work well if you need to combine it with other operations, or if you have duplicate names.
Another approach is to convert the map into an array using map_entries()
, use filter()
to select the pet(s) with a count that equals the maximum count, then use transform()
to only return the pet name. At this point, you have an array of the maximum pets. You can then UNNEST
it into multiple rows, or keep it as an array for further processing. filter()
and transform()
utilize a lambda expression which is a Presto specific extension to SQL.
WITH people (name, pets) AS (
VALUES
('Andy', map_from_entries(array[('dog', 2), ('cat', 1), ('bird', 4)])),
('John', map_from_entries(array[('tiger', 3), ('elephant', 1), ('fish', 2)])),
('Mary', map_from_entries(array[('dog', 2), ('pig', 2)]))
)
SELECT
name,
transform(
filter(
map_entries(pets),
e -> e[2] = array_max(map_values(pets))),
e -> e[1]) AS max_pets
FROM people;
name | max_pets
------+------------
Andy | [bird]
John | [tiger]
Mary | [dog, pig]
(3 rows)
For Athena users, where accessing row fields by index is not available (true at least for the 2nd version of the engine, have not checked the 3rd) the second approach from answer can be rewritten with map_keys
and map access by index:
-- sample data
WITH people (name, pets) AS (
VALUES
('Andy', map_from_entries(array[('dog', 2), ('cat', 1), ('bird', 4)])),
('John', map_from_entries(array[('tiger', 3), ('elephant', 1), ('fish', 2)])),
('Mary', map_from_entries(array[('dog', 2), ('pig', 2)]))
)
-- query
SELECT
name,
transform(
filter(
map_keys(pets),
e -> pets[e] = array_max(map_values(pets))), -- we now that e is present in pets
e -> e) AS max_pets
FROM people;
Output:
name | max_pets |
---|---|
Andy | [bird] |
John | [tiger] |
Mary | [dog, pig] |
Or just use map_filter
:
SELECT
name,
map_keys(
map_filter(
pets,
(k,v) -> v = array_max(map_values(pets)))) max_pets
FROM people;
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With