I have a mysql db which has around 150 millions inserts per day and retention period is around 60 days.
My main use case is follows:
Run some bulk queries. eg.:
Select (*) from table where prop=val1 and prop2=val2 etc
Will returns large no of records eg. 1M
Is following approach good:
Is this a good solution?
EDIT : I plan to use Amazon RDS DB and found this in their documentation:
Q: Can my Read Replicas only accept database read operations?
Read Replicas are designed to serve read traffic. However, there may be use cases where advanced users wish to complete Data Definition Language (DDL) SQL statements against a Read Replica. Examples might include adding a database index to a Read Replica that is used for business reporting, without adding the same index to the corresponding source DB Instance. If you wish to enable operations other than reads for a given Read Replica, you will need to modify the active DB Parameter Group for the Read Replica, setting the “read_only” parameter to “0.”
Sign in to the AWS Management Console and open the Amazon RDS console at https://console.aws.amazon.com/rds/ . In the navigation pane, choose Databases. Choose the MariaDB, MySQL, Oracle, or PostgreSQL DB instance that you want to use as the source for a read replica. For Actions, choose Create read replica.
Benefits of Read Replica Read Replica helps in decreasing load on the primary DB by serving read-only traffic. A Read Replica can be manually promoted as a standalone database instance. You can create Read Replicas within AZ, Cross-AZ or Cross-Region.
You can now create a second-tier Read Replica from an existing first-tier Read Replica. By creating a second-tier Read Replica, you may be able to move some of the replication load from the master database instance to a first-tier Read Replica.
Replication enables data from one MySQL database server (known as a source) to be copied to one or more MySQL database servers (known as replicas). Replication is asynchronous by default; replicas do not need to be connected permanently to receive updates from a source.
To answer your question:
Is following approach good:
- Have a master DB with index on id only. Have a retention of 60 days.
- Have Read Replica DB. This DB will be indexed on many columns
- All bulk queries will be run against read replica DB.
Is this a good solution?
Updated
In my opinion and experience, No.
Technically, this solution may work, but practically not suitable for production use. The built in master-slave replication
of mysql, works only if the table in the slave database has the same layout as the table in the master database.
You will have approximately 9 billion records (150 x 60). My estimate is on disk this could take up to 1TB (each record the size of a tweet). 150 million inserts and 150 million deletes (of expired records) will surely make indexes fragmented and inserts
slower, requiring re build frequently.
Things will get incrementally more complicated when you need more than one read replica, a natural evolution of the ecosystem.
If you have 150 million inserts a day, you should consider a NOSQL
database. Mongodb used to support Innodb
as well, not sure if it still does.
If you wish to stick to an RDBMS like MySQL
, you should use strategy such as Database Sharding. In this strategy, you segment your data in such a way that the load gets distributed across a cluster of MySQL instances.
A slightly less scalable than Sharding is to use a storage engine such as MyISAM. MyISAM is not fully ACID compliant but offers great performance. It supports concurrent inserts.
Consider using Fastbit if your primary use is SELECT * with no joins and multiple filters on different columns. Fastbit implements WAH compressed bitmaps that can be evaluated very efficiently and stores data as a column store.
https://sdm.lbl.gov/fastbit/
For MySQL, perhaps consider TokuDB which has 'clustered' index support, or creating covering indexes in InnoDB. This is really only effective if you have a small combination of attributes to filter on. If not, consider fastbit.
If you always filter on the same attributes, then you can consider using Flexviews: http://flexvie.ws
You could create a view for select * from table where val1=X and val2=Y
or just roll your own version. after loading data do: replace into summary_table_v2v2 select * from table where val1=X and val2=Y and table.last_update > NOW()-INTERVAL 1 DAY;
That will "refresh" the table with any changes made in the last day, assuming last_update is a timestamp column.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With