Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How do I filter the top 1% and lower 1% of data in each group in SQL

I have a data set that includes PRICE, SUBTYPE, and others. I want to do some outlier removal before I use the dataset. I want to remove rows for things where the price is ridiculously high or low, in each SUBTYPE.

For each SUBTYPE look at the range of the PRICEs and remove or filter out rows. Keep rows that fall between: PRICErange * .01 |KEEP| PRICErange * .99

This was provided to me by a Martin Smith on stackoverflow, I edited this question, so lets start from here.

;WITH CTE       
AS (SELECT *,                   
ROW_NUMBER() OVER (PARTITION BY SUBTYPE ORDER BY PRICE) AS RN,                    
COUNT(*) OVER(PARTITION BY SUBTYPE) AS Cnt             
FROM    all_resale)    
SELECT *    
FROM   CTE    
WHERE (CASE WHEN Cnt > 1 THEN 100.0 * (RN -1)/(Cnt -1) END) BETWEEN 1 AND 99

I'm not sure this is what I need to do. I don't know how many rows will be removed off the ends.

like image 201
Brandon Smith Avatar asked Jun 14 '13 12:06

Brandon Smith


People also ask

Can we use top with GROUP BY clause?

Typically, these are accomplished using the TOP or LIMIT clause. Problem is, Top N result sets are limited to the highest values in the table, without any grouping. The GROUP BY clause can help with that, but it is limited to the single top result for each group.

How do you SELECT the first least maximum row per group in SQL?

To do that, you can use the ROW_NUMBER() function. In OVER() , you specify the groups into which the rows should be divided ( PARTITION BY ) and the order in which the numbers should be assigned to the rows ( ORDER BY ). You assign the row numbers within each group (i.e., year).

Which clause is used to filter grouped data?

After Grouping the data, you can filter the grouped record using HAVING Clause. HAVING Clause returns the grouped records which match the given condition. You can also sort the grouped records using ORDER BY. ORDER BY used after GROUP BY on aggregated column.

How do you filter aggregate data?

To filter records using the aggregate function, use the HAVING clause. Here we calculate the aggregate value: the average price of each product. One is sold by more than one grocer; therefore the average price is calculated for each (in our example, SELECT name, AVG(price) ).


1 Answers

You don't specify exactly how you define the 1 percent and how ties should be handled.

One way is below

;WITH CTE
     AS (SELECT *,
                ROW_NUMBER() OVER (PARTITION BY SUBTYPE ORDER BY PRICE) AS RN,
                COUNT(*) OVER(PARTITION BY SUBTYPE) AS Cnt
         FROM    all_resale)
SELECT *
FROM   CTE
WHERE (CASE WHEN Cnt > 1 THEN 100.0 * (RN -1)/(Cnt -1) END) BETWEEN 1 AND 99

That assumes the highest price item is 100%, the lowest price one 0% and all others scaled evenly between taking no account of ties. If you need to take account of ties look into RANK rather than ROW_NUMBER

NB: If all of the subtypes have a relatively large amount of rows you could use NTILE(100) instead but it does not distribute between buckets well if the number of rows is small relative to number of buckets.

like image 109
Martin Smith Avatar answered Sep 28 '22 16:09

Martin Smith