Need advice on data model for my use case. I have two parameters to store, A for things of type T and,B for things of type U(which is set of T's) Lets say every object of type T has 2 properties p1 and p2, now A= (count of t's with p1)/(count of t's with p1)+(count of t's with p1)
B= (A1+A2+.. )for its set of T's/ (Number of T's in U).
Now, i have to tackle the storage and updation of both A and B whenever a new object of type T is added/modified.(Almost instantly)
I have decided to tackle the calculation of A as follows, to maintain a table like (T id, No. of p1, No. of p2), thereby every time the number changes i just update 2nd or 3rd column and i can calculate A on the fly. But i am confused on how to optimize calculation of B?? My initial thoughts were to write a trigger on above table so that whenever something gets updated, recalculate B for that U object, but i think that would give me very poor performance when i scale up, Any suggestions what else i might do here?
Example: Say U is a city with many blocks(T). Now, every block will have say p1 number of non veg restaurants and p2 number of veg. So, A for every block would be p1/(p1+p2) and B for every city would be A1+A2+.. / count(blocks) in that city. How do i store the initially calculated A and B for all the objects such that when p1 and p2 keep changing, i need update A and B almost instantly.
Adding metrics, to have more clarity on the desired solution,
Latency should be ~100ms i,e A and B should be available after p1/p2 change.
Frequency of writes will be in spikes, it will be a 100 or 1000 writes simultaneously or 3-5.
Using your cities/blocks example, your schema could be something like:
CREATE TABLE cities (
`city_id` SMALLINT UNSIGNED NOT NULL AUTO_INCREMENT,
`country_id` TINYINT UNSIGNED NOT NULL,
`zip` VARCHAR(50) NOT NULL,
`name` VARCHAR(100) NOT NULL,
PRIMARY KEY (`city_id`)
);
CREATE TABLE blocks (
`block_id` MEDIUMINT UNSIGNED NOT NULL AUTO_INCREMENT,
`city_id` SMALLINT UNSIGNED NOT NULL,
`p1` SMALLINT UNSIGNED NOT NULL DEFAULT '0',
`p2` SMALLINT UNSIGNED NOT NULL DEFAULT '1',
PRIMARY KEY (`block_id`),
FOREIGN KEY (`city_id`) REFERENCES `cities` (`city_id`)
);
Your query for a given city (city_id = 123
) would be:
Query 1
SELECT AVG(p1/(p1+p2)) AS B
FROM blocks b
WHERE b.city_id = 123
Note: AVG(x) = SUM(x) / COUNT(x)
Now, if you worry about performance you should define some expected numbers:
If you have defined these numbers, you can generate some dummy/fake data to run performance tests against it.
Here is an example with 1000 cities and 100K blocks (100 blocks per city on average):
First create a helper table with 100K sequence numbers:
CREATE TABLE IF NOT EXISTS seq100k
SELECT NULL AS seq
FROM information_schema.COLUMNS c1
JOIN information_schema.COLUMNS c2
JOIN information_schema.COLUMNS c3
LIMIT 100000;
ALTER TABLE seq100k CHANGE COLUMN seq seq MEDIUMINT UNSIGNED AUTO_INCREMENT PRIMARY KEY;
With MariaDB you can use the sequence plugin instead.
Generate the data:
DROP TABLE IF EXISTS blocks;
DROP TABLE IF EXISTS cities;
CREATE TABLE cities (
`city_id` SMALLINT UNSIGNED NOT NULL AUTO_INCREMENT,
`country_id` TINYINT UNSIGNED NOT NULL,
`zip` VARCHAR(50) NOT NULL,
`name` VARCHAR(100) NOT NULL,
PRIMARY KEY (`city_id`)
)
SELECT seq AS city_id
, floor(rand(1)*10+1) as country_id
, floor(rand(2)*99999+1) as zip
, rand(3) as name
FROM seq100k
LIMIT 1000;
CREATE TABLE blocks (
`block_id` MEDIUMINT UNSIGNED NOT NULL AUTO_INCREMENT,
`city_id` SMALLINT UNSIGNED NOT NULL,
`p1` SMALLINT UNSIGNED NOT NULL DEFAULT '0',
`p2` SMALLINT UNSIGNED NOT NULL DEFAULT '1',
PRIMARY KEY (`block_id`),
FOREIGN KEY (`city_id`) REFERENCES `cities` (`city_id`)
)
SELECT seq AS block_id
, floor(rand(4)*1000+1) as city_id
, floor(rand(5)*11) as p1
, floor(rand(6)*20+1) as p2
FROM seq100k
LIMIT 100000;
Now you can run your queries. Note that I will not use exact runtimes. If you need them to be exact, you should use profiling.
Running Query 1 my GUI (HeidiSQL) shows 0.000 sec
, which I call "almost instant".
You might want to run a query like:
Query 2
SELECT b.city_id, AVG(p1/(p1+p2)) AS B
FROM blocks b
GROUP BY b.city_id
ORDER BY B DESC
LIMIT 10
HeidiSQL shows 0.078 sec
.
Using a covering index
ALTER TABLE `blocks`
DROP INDEX `city_id`,
ADD INDEX `city_id` (`city_id`, `p1`, `p2`);
you can decrease the runtime to 0.031 sec
. If that isn't fast enough, you should think about some caching strategies. One way (beside caching on application level) is to use triggers to manage a new column in the cities
table (let's just call it B
):
ALTER TABLE `cities` ADD COLUMN `B` FLOAT NULL DEFAULT NULL AFTER `name`;
Define the update trigger:
DROP TRIGGER IF EXISTS `blocks_after_update`;
DELIMITER //
CREATE TRIGGER `blocks_after_update` AFTER UPDATE ON `blocks` FOR EACH ROW BEGIN
if new.p1 <> old.p1 or new.p2 <> old.p2 then
update cities c
set c.B = (
select avg(p1/(p1+p2))
from blocks b
where b.city_id = new.city_id
)
where c.city_id = new.city_id;
end if;
END//
DELIMITER ;
Update test:
Query 3
UPDATE blocks b SET p2 = p2 + 100 WHERE 1=1;
UPDATE blocks b SET p2 = p2 - 100 WHERE 1=1;
This query runs in 2.500 sec
without the trigger and 60 sec
with the trigger. This might look like a lot of overhead - But consider, that we are updating 100K rows twice - which means an average of 60K msec / 200K updates = 0.3 msec/update
.
And now you can get the same result from Query 2 with
Query 4
SELECT c.city_id, c.B
FROM cities c
ORDER BY c.B DESC
LIMIT 10
"almost instantly" (0.000 sec
).
You can still optimize the trigger if you need. Using an additional column block_count
in the cities
table (which also needs to be managed with triggers).
Add column:
ALTER TABLE `cities`
ADD COLUMN `block_count` MEDIUMINT UNSIGNED NOT NULL DEFAULT '0' AFTER `B`;
Init data:
UPDATE cities c SET c.block_count = (
SELECT COUNT(*)
FROM blocks b
WHERE b.city_id = c.city_id
)
WHERE 1=1;
Rewrite the trigger:
DROP TRIGGER IF EXISTS `blocks_after_update`;
DELIMITER //
CREATE TRIGGER `blocks_after_update` AFTER UPDATE ON `blocks` FOR EACH ROW BEGIN
declare old_A, new_A double;
if new.p1 <> old.p1 or new.p2 <> old.p2 then
set old_A = old.p1/(old.p1+old.p2);
set new_A = new.p1/(new.p1+new.p2);
update cities c
set c.B = (c.B * c.block_count - old_A + new_A) / c.block_count
where c.city_id = new.city_id;
end if;
END//
DELIMITER ;
With this trigger Query 3 now runs in 8.5 sec
. This means an overhead of 0.03 msec
per update.
Note that you will also need to define the INSERT and DELETE triggers. And you will need to add more logic (e.g. to handle changes in city_id
on updates). But it's also possible that you will not need any trigger at all.
You can also use a meterialized view (concept of postgres), in mysql it doesn't exist, but you can use a table for that:
CREATE TABLE analyzeVeg
SELECT b.city_id, AVG(p1/(p1+p2)) AS B
FROM blocks b
GROUP BY b.city_id;
Took me with 1000 cities and 100'000 blocks 200ms to create and almost for the query
select * from analyzeVeg
around 1 ms.
You can either actualise the data in a trigger, or in the application logic with:
UPDATE analyzeVeg a set B=(SELECT AVG(p1/(p1+p2)) FROM blocks b where b.city_id = a.city_id) WHERE a.city_id IN(SELECT city_id FROM blocks where block_id IN( :updated_block_ids ))
took me 19ms to update
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With