I want to compute the Median of y
in sub groups of this simple xy_table
:
x | y --groups--> gid | x | y --medians--> gid | x | y
------- ------------- -------------
0.1 | 4 0.0 | 0.1 | 4 0.0 | 0.1 | 4
0.2 | 3 0.0 | 0.2 | 3 | |
0.7 | 5 1.0 | 0.7 | 5 1.0 | 0.7 | 5
1.5 | 1 2.0 | 1.5 | 1 | |
1.9 | 6 2.0 | 1.9 | 6 | |
2.1 | 5 2.0 | 2.1 | 5 2.0 | 2.1 | 5
2.7 | 1 3.0 | 2.7 | 1 3.0 | 2.7 | 1
In this example every x
is unique and the table is already sorted by x
.
I now want to GROUP BY round(x)
and get the tuple that holds the median of y
in each group.
I can already compute the median for the whole table with this ranking query:
SELECT a.x, a.y FROM xy_table a,xy_table b
WHERE a.y >= b.y
GROUP BY a.x, a.y
HAVING count(*) = (SELECT round((count(*)+1)/2) FROM xy_table)
Output: 0.1, 4.0
But I did not yet succeed writing a query to compute the median for sub groups.
Attention: I do not have a median()
aggregation function available. Please also do not propose solutions with special PARTITION
, RANK
, or QUANTILE
statements (as found in similar but too vendor specific SO questions). I need plain SQL (i.e., compatible to SQLite without median()
function)
Edit: I was actually looking for the Medoid and not the Median.
We use ROW_Number() SQL RANK function to get a unique sequential number for each row in the specified data. It gives the rank one for the first row and then increments the value by one for each row. We get different ranks for the row having similar values as well.
This order of operations implies that you can only use window functions in SELECT and ORDER BY . That is, window functions are not accessible in WHERE , GROUP BY , or HAVING clauses. For this reason, you cannot use any of these functions in WHERE : ROW_NUMBER() , RANK() , DENSE_RANK() , LEAD() , LAG() , or NTILE() .
To get the median we have to use PERCENTILE_CONT(0.5). If you want to define a specific set of rows grouped to get the median, then use the OVER (PARTITION BY) clause. Here I've used PARTITION BY on the column OrderID so as to find the median of unit prices for the order ids.
I suggest doing the computing in your programming language:
for each group:
for each record_in_group:
append y to array
median of array
But if you are stuck with SQLite, you can order each group by y
and select the records in the middle like this http://sqlfiddle.com/#!5/d4c68/55/0:
UPDATE: only bigger "median" value is importand for even nr. of rows, so no avg()
is needed:
select groups.gid,
ids.y median
from (
-- get middle row number in each group (bigger number if even nr. of rows)
-- note the integer divisions and modulo operator
select round(x) gid,
count(*) / 2 + 1 mid_row_right
from xy_table
group by round(x)
) groups
join (
-- for each record get equivalent of
-- row_number() over(partition by gid order by y)
select round(a.x) gid,
a.x,
a.y,
count(*) rownr_by_y
from xy_table a
left join xy_table b
on round(a.x) = round (b.x)
and a.y >= b.y
group by a.x
) ids on ids.gid = groups.gid
where ids.rownr_by_y = groups.mid_row_right
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With