I have user data:
user store item cost
1 10 100 5
1 10 101 3
1 11 102 7
2 10 101 3
2 12 103 4
2 12 104 5
I want a table which will tell me for each user how much he bought from each store and how much he bought in total:
user store cost_this_store cost_total
1 10 8 15
1 11 7 15
2 10 3 12
2 12 9 12
I can do this with two group by and a join:
select s.user, s.store, s.cost_this_store, u.cost_total
from (select user, store, sum(cost) as cost_this_store
from my_data
group by user, store) s
join (select user, sum(cost) as cost_total
from my_data
group by user) u
on s.user = u.user
However, this is definitely not how I would do this if I were writing this in any other language (join is clearly avoidable, and the two group by are not independent).
Is it possible to avoid the join in sql?
PS. I need the solution to work in hive.
You can do this with a windowing function... which Hive added support for last year:
select distinct
user,
store,
sum(cost) over (partition by user, store) as cost_this_store,
sum(cost) over (partition by user) as cost_total
from my_data
However, I'd argue that there wasn't anything glaringly wrong with your original implementation. You've essentially got two different sets of data, which you're combining through a JOIN.
The duplication might look like a code smell in a different language, but this isn't necessarily the wrong approach in SQL, and often you'll have to take approaches such as this that duplicate a portion of a query between two intermediate result sets for performance reasons.
SQL Fiddle (SQL Server)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With