Need advice in data model

Question

Need advice on data model for my use case. I have two parameters to store, A for things of type T and,B for things of type U(which is set of T's) Lets say every object of type T has 2 properties p1 and p2, now A= (count of t's with p1)/(count of t's with p1)+(count of t's with p1)

B= (A1+A2+.. )for its set of T's/ (Number of T's in U).

Now, i have to tackle the storage and updation of both A and B whenever a new object of type T is added/modified.(Almost instantly)

I have decided to tackle the calculation of A as follows, to maintain a table like (T id, No. of p1, No. of p2), thereby every time the number changes i just update 2nd or 3rd column and i can calculate A on the fly. But i am confused on how to optimize calculation of B?? My initial thoughts were to write a trigger on above table so that whenever something gets updated, recalculate B for that U object, but i think that would give me very poor performance when i scale up, Any suggestions what else i might do here?

Example: Say U is a city with many blocks(T). Now, every block will have say p1 number of non veg restaurants and p2 number of veg. So, A for every block would be p1/(p1+p2) and B for every city would be A1+A2+.. / count(blocks) in that city. How do i store the initially calculated A and B for all the objects such that when p1 and p2 keep changing, i need update A and B almost instantly.

Adding metrics, to have more clarity on the desired solution,

I already have 100k blocks and close to 1000 cities. And this number is going to rise in future. My requirement is, once i calculate A and B for all the existing data, any updation to p1 and p2 that causes change say 'deltaA'.Now this 'deltaA' should be easily appended to 'A' rather than recalculating A(similarly for B), can't we have some data model that can support this?
Latency should be ~100ms i,e A and B should be available after p1/p2 change.
Frequency of writes will be in spikes, it will be a 100 or 1000 writes simultaneously or 3-5.

Paul Spiegel · Accepted Answer

Using your cities/blocks example, your schema could be something like:

CREATE TABLE cities (
    `city_id` SMALLINT UNSIGNED NOT NULL AUTO_INCREMENT,
    `country_id` TINYINT UNSIGNED NOT NULL,
    `zip` VARCHAR(50) NOT NULL,
    `name` VARCHAR(100) NOT NULL,
    PRIMARY KEY (`city_id`)
);

CREATE TABLE blocks (
    `block_id` MEDIUMINT UNSIGNED NOT NULL AUTO_INCREMENT,
    `city_id` SMALLINT UNSIGNED NOT NULL,
    `p1` SMALLINT UNSIGNED NOT NULL DEFAULT '0',
    `p2` SMALLINT UNSIGNED NOT NULL DEFAULT '1',
    PRIMARY KEY (`block_id`),
    FOREIGN KEY (`city_id`) REFERENCES `cities` (`city_id`)
);

Your query for a given city (city_id = 123) would be:

Query 1

SELECT AVG(p1/(p1+p2)) AS B
FROM blocks b
WHERE b.city_id = 123

Note: AVG(x) = SUM(x) / COUNT(x)

Now, if you worry about performance you should define some expected numbers:

Number of cities
(Average) Number of blocks per city
Hardware you will/can use
Queries you will usually run
Number of queries per hour/minute/sec

If you have defined these numbers, you can generate some dummy/fake data to run performance tests against it.

Here is an example with 1000 cities and 100K blocks (100 blocks per city on average):

First create a helper table with 100K sequence numbers:

CREATE TABLE IF NOT EXISTS seq100k
    SELECT NULL AS seq
    FROM information_schema.COLUMNS c1
    JOIN information_schema.COLUMNS c2
    JOIN information_schema.COLUMNS c3
    LIMIT 100000;
ALTER TABLE seq100k CHANGE COLUMN seq seq MEDIUMINT UNSIGNED AUTO_INCREMENT PRIMARY KEY;

With MariaDB you can use the sequence plugin instead.

Generate the data:

DROP TABLE IF EXISTS blocks;
DROP TABLE IF EXISTS cities;
CREATE TABLE cities (
    `city_id` SMALLINT UNSIGNED NOT NULL AUTO_INCREMENT,
    `country_id` TINYINT UNSIGNED NOT NULL,
    `zip` VARCHAR(50) NOT NULL,
    `name` VARCHAR(100) NOT NULL,
    PRIMARY KEY (`city_id`)
)
    SELECT seq AS city_id
         , floor(rand(1)*10+1) as country_id
         , floor(rand(2)*99999+1) as zip
         , rand(3) as name
    FROM seq100k
    LIMIT 1000;

CREATE TABLE blocks (
    `block_id` MEDIUMINT UNSIGNED NOT NULL AUTO_INCREMENT,
    `city_id` SMALLINT UNSIGNED NOT NULL,
    `p1` SMALLINT UNSIGNED NOT NULL DEFAULT '0',
    `p2` SMALLINT UNSIGNED NOT NULL DEFAULT '1',
    PRIMARY KEY (`block_id`),
    FOREIGN KEY (`city_id`) REFERENCES `cities` (`city_id`)
)
    SELECT seq AS block_id
         , floor(rand(4)*1000+1) as city_id
         , floor(rand(5)*11) as p1
         , floor(rand(6)*20+1) as p2
    FROM seq100k
    LIMIT 100000;

Now you can run your queries. Note that I will not use exact runtimes. If you need them to be exact, you should use profiling.

Running Query 1 my GUI (HeidiSQL) shows 0.000 sec, which I call "almost instant".

You might want to run a query like:

Query 2

SELECT b.city_id, AVG(p1/(p1+p2)) AS B
FROM blocks b
GROUP BY b.city_id
ORDER BY B DESC
LIMIT 10

HeidiSQL shows 0.078 sec.

Using a covering index

ALTER TABLE `blocks`
    DROP INDEX `city_id`,
    ADD INDEX `city_id` (`city_id`, `p1`, `p2`);

you can decrease the runtime to 0.031 sec. If that isn't fast enough, you should think about some caching strategies. One way (beside caching on application level) is to use triggers to manage a new column in the cities table (let's just call it B):

ALTER TABLE `cities` ADD COLUMN `B` FLOAT NULL DEFAULT NULL AFTER `name`;

Define the update trigger:

DROP TRIGGER IF EXISTS `blocks_after_update`;
DELIMITER //
CREATE TRIGGER `blocks_after_update` AFTER UPDATE ON `blocks` FOR EACH ROW BEGIN
    if new.p1 <> old.p1 or new.p2 <> old.p2 then
        update cities c
        set c.B = (
            select avg(p1/(p1+p2))
            from blocks b
            where b.city_id = new.city_id
        )
        where c.city_id = new.city_id;
    end if;
END//
DELIMITER ;

Update test:

Query 3

UPDATE blocks b SET p2 = p2 + 100 WHERE 1=1;
UPDATE blocks b SET p2 = p2 - 100 WHERE 1=1;

This query runs in 2.500 sec without the trigger and 60 sec with the trigger. This might look like a lot of overhead - But consider, that we are updating 100K rows twice - which means an average of 60K msec / 200K updates = 0.3 msec/update.

And now you can get the same result from Query 2 with

Query 4

SELECT c.city_id, c.B
FROM cities c
ORDER BY c.B DESC
LIMIT 10

"almost instantly" (0.000 sec).

You can still optimize the trigger if you need. Using an additional column block_count in the cities table (which also needs to be managed with triggers).

Add column:

ALTER TABLE `cities`
    ADD COLUMN `block_count` MEDIUMINT UNSIGNED NOT NULL DEFAULT '0' AFTER `B`;

Init data:

UPDATE cities c SET c.block_count = (
    SELECT COUNT(*)
    FROM blocks b
    WHERE b.city_id = c.city_id
)
WHERE 1=1;

Rewrite the trigger:

DROP TRIGGER IF EXISTS `blocks_after_update`;
DELIMITER //
CREATE TRIGGER `blocks_after_update` AFTER UPDATE ON `blocks` FOR EACH ROW BEGIN
    declare old_A, new_A double;
    if new.p1 <> old.p1 or new.p2 <> old.p2 then
        set old_A = old.p1/(old.p1+old.p2);
        set new_A = new.p1/(new.p1+new.p2);
        update cities c
            set c.B = (c.B * c.block_count - old_A + new_A) / c.block_count
            where c.city_id = new.city_id;
    end if;
END//
DELIMITER ;

With this trigger Query 3 now runs in 8.5 sec. This means an overhead of 0.03 msec per update.

Note that you will also need to define the INSERT and DELETE triggers. And you will need to add more logic (e.g. to handle changes in city_id on updates). But it's also possible that you will not need any trigger at all.

BacLuc · Answer

You can also use a meterialized view (concept of postgres), in mysql it doesn't exist, but you can use a table for that:

CREATE TABLE analyzeVeg 
SELECT b.city_id, AVG(p1/(p1+p2)) AS B
FROM blocks b
GROUP BY b.city_id;

Took me with 1000 cities and 100'000 blocks 200ms to create and almost for the query

select * from analyzeVeg

around 1 ms.

You can either actualise the data in a trigger, or in the application logic with:

UPDATE analyzeVeg a set B=(SELECT AVG(p1/(p1+p2)) FROM blocks b where b.city_id = a.city_id) WHERE a.city_id IN(SELECT city_id FROM blocks where block_id IN( :updated_block_ids ))

took me 19ms to update

Need advice in data model

Tags:

database

mysql

database-design

data-modeling

Bhargav

2 Answers

Paul Spiegel

BacLuc

Recent Activity

Donate For Us

Need advice in data model

Tags:

database

mysql

database-design

data-modeling

Bhargav

2 Answers

Paul Spiegel

BacLuc

Related questions

Recent Activity

Donate For Us