I want to write a stored proc in SQL (MySQL) to compute the average of second and third quartiles.
In other words I have records for measurements for how long it takes for an URL to load. Records are (id,url,time) and they are many measurements for each URL. What I'm trying to do is for each URL remove the lowest and top 25% (i.e. lower and upper quartiles) and compute average of the remaining 25%-75% of loading times. And store this into another table.
I saw some examples for this for MS SQL and seemed to be relatively easy. But I have to use MySQL where :
And I got as far as here :
create procedure G(
IN val VARCHAR(10)
)
Begin
select @cnt:=count(*) from test where a=val;
select @of:= @cnt /4;
SELECT @len:= @cnt/2;
Prepare stmt from 'select * from test where a="a" LIMIT ?,?';
execute stmt using @of, @len;
END;
I can write it in PHP but think in SQL it would have much better overall performance. I will appreciate some help very much.
The avg() function has the following syntax: SELECT AVG( column_name ) FROM table_name; The avg() function can be used with the SELECT query for retrieving data from a table.
Show activity on this post. WITH C AS ( SELECT SalesPersonID, SaleYear, SaleAmt, rk = RANK() OVER (ORDER BY SaleAmt), nr = COUNT(*) OVER () FROM ##Sales ) SELECT SalesPersonID, SaleYear, SaleAmt, PercentileRank = 1.0*(rk-1)/(nr-1) FROM C; What would I need to modify to get this in a quintile range?
AVG() With a DISTINCT Clause To do this in SQL, you have to use a DISTINCT clause. You can see the annual_salary column in our employees table has values that repeat, for instance, 5000 and 10000. If we want SQL to consider these values only once, we use AVG() with a DISTINCT clause.
Look at answer and comment by @Richard aka cyberkiwi in this question:
Select *
from
(
SELECT tbl.*, @counter := @counter +1 counter
FROM (select @counter:=0) initvar, tbl
ORDER BY ordcolumn
) X
where counter >= (25/100 * @counter) and counter <= (75/100 * @counter);
ORDER BY ordcolumn
You can create the quartile values by using IF to set them to zero if in the wrong quartile:
Let's assume, the raw data table is created by
DROP TABLE IF EXISTS `rawdata`;
CREATE TABLE `rawdata` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`url` varchar(250) NOT NULL DEFAULT '',
`time` int(11) NOT NULL,
PRIMARY KEY (`id`),
KEY `time` (`time`)
) ENGINE=MyISAM DEFAULT CHARSET=utf8;
(and ofcourse populated).
Let's also assume the quartile table data is created by
DROP TABLE IF EXISTS `quartiles`;
CREATE TABLE `quartiles` (
`url` varchar(250) NOT NULL,
`Q1` float DEFAULT '0',
`Q2` float DEFAULT '0',
`Q3` float DEFAULT '0',
`Q4` float DEFAULT '0',
PRIMARY KEY (`url`),
) ENGINE=MyISAM DEFAULT CHARSET=utf8;
(and left empty).
Then a procedure to populate quartiles from rawdata would look like
DELIMITER ;;
CREATE PROCEDURE `ComputeQuartiles`()
READS SQL DATA
BEGIN
DECLARE numrows int DEFAULT 0;
DECLARE qrows int DEFAULT 0;
DECLARE rownum int DEFAULT 0;
DECLARE done int DEFAULT 0;
DECLARE currenturl VARCHAR(250) CHARACTER SET utf8;
DECLARE Q1,Q2,Q3,Q4 float DEFAULT 0.0;
DECLARE allurls CURSOR FOR SELECT DISTINCT url FROM rawdata;
DECLARE CONTINUE HANDLER FOR NOT FOUND SET currenturl='';
OPEN allurls;
FETCH allurls INTO currenturl;
WHILE currenturl<>'' DO
SELECT COUNT(*) INTO numrows FROM rawdata WHERE url=currenturl;
SET qrows=FLOOR(numrows/4);
if qrows>0 THEN
-- Only session parameters can be recalculated inside a query,
-- so @rownum:=@rownum+1 will work, but rownum:=rownum+1 will not.
SET @rownum=0;
SELECT
SUM(IFNULL(QA,0))/qrows,
SUM(IFNULL(QB,0))/qrows,
SUM(IFNULL(QC,0))/qrows,
SUM(IFNULL(QD,0))/qrows
FROM (
SELECT
if(@rownum<qrows,time,0) AS QA,
if(@rownum>=qrows AND @rownum<2*qrows,time,0) AS QB,
-- the middle 0-3 rows are left out
if(@rownum>=(numrows-2*qrows) AND @rownum<(numrows-qrows),time,0) AS QC,
if(@rownum>=(numrows-qrows),time,0) AS QD,
@rownum:=@rownum+1 AS dummy
FROM rawdata
WHERE url=currenturl ORDER BY time
) AS baseview
INTO Q1,Q2,Q3,Q4
;
REPLACE INTO quartiles values (currenturl,Q1,Q2,Q3,Q4);
END IF;
FETCH allurls INTO currenturl;
END WHILE;
CLOSE allurls;
END ;;
DELIMITER ;
The main points being:
(rowcount % 4) != 0
time
to one of QA-QD, depending on the row number, assigning the other Qx the value 0I tested this with 18432 raw rows, url=concat('http://.../',floor(rand()*10)), time=round(rand()*10000)
on a 8x1.9GHz machine and it finished consistently in 0.50-0.54sec
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With