I have a data set that includes PRICE, SUBTYPE, and others. I want to do some outlier removal before I use the dataset. I want to remove rows for things where the price is ridiculously high or low, in each SUBTYPE.
For each SUBTYPE look at the range of the PRICEs and remove or filter out rows. Keep rows that fall between: PRICErange * .01 |KEEP| PRICErange * .99
This was provided to me by a Martin Smith on stackoverflow, I edited this question, so lets start from here.
;WITH CTE
AS (SELECT *,
ROW_NUMBER() OVER (PARTITION BY SUBTYPE ORDER BY PRICE) AS RN,
COUNT(*) OVER(PARTITION BY SUBTYPE) AS Cnt
FROM all_resale)
SELECT *
FROM CTE
WHERE (CASE WHEN Cnt > 1 THEN 100.0 * (RN -1)/(Cnt -1) END) BETWEEN 1 AND 99
I'm not sure this is what I need to do. I don't know how many rows will be removed off the ends.
Typically, these are accomplished using the TOP or LIMIT clause. Problem is, Top N result sets are limited to the highest values in the table, without any grouping. The GROUP BY clause can help with that, but it is limited to the single top result for each group.
To do that, you can use the ROW_NUMBER() function. In OVER() , you specify the groups into which the rows should be divided ( PARTITION BY ) and the order in which the numbers should be assigned to the rows ( ORDER BY ). You assign the row numbers within each group (i.e., year).
After Grouping the data, you can filter the grouped record using HAVING Clause. HAVING Clause returns the grouped records which match the given condition. You can also sort the grouped records using ORDER BY. ORDER BY used after GROUP BY on aggregated column.
To filter records using the aggregate function, use the HAVING clause. Here we calculate the aggregate value: the average price of each product. One is sold by more than one grocer; therefore the average price is calculated for each (in our example, SELECT name, AVG(price) ).
You don't specify exactly how you define the 1 percent and how ties should be handled.
One way is below
;WITH CTE
AS (SELECT *,
ROW_NUMBER() OVER (PARTITION BY SUBTYPE ORDER BY PRICE) AS RN,
COUNT(*) OVER(PARTITION BY SUBTYPE) AS Cnt
FROM all_resale)
SELECT *
FROM CTE
WHERE (CASE WHEN Cnt > 1 THEN 100.0 * (RN -1)/(Cnt -1) END) BETWEEN 1 AND 99
That assumes the highest price item is 100%
, the lowest price one 0%
and all others scaled evenly between taking no account of ties. If you need to take account of ties look into RANK
rather than ROW_NUMBER
NB: If all of the subtypes have a relatively large amount of rows you could use NTILE(100)
instead but it does not distribute between buckets well if the number of rows is small relative to number of buckets.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With