I am wondering what is more efficient and faster in performance: Having an index on one big table or multiple smaller tables without indexes? Since this is a pretty abstract problem let me make it more practical: I have one table with statistics about users (20,000 users and about 30 million rows overall). The table has about 10 columns including the <code>user_id</code>, <code>actions</code>, <code>timestamps</code>, etc. Most common applications are: Inserting data by <code>user_id</code> and retrieving data by user_id (<code>SELECT</code> statements never include multiple <code>user_id's</code>). Now so far I have an <code>INDEX</code> on the <code>user_id</code> and the query looks something like this <pre class="prettyprint"><code>SELECT * FROM statistics WHERE user_id = 1 </code></pre> Now, with more and more rows the table gets slower and slower. <code>INSERT</code> statements slow down because the <code>INDEX</code> gets bigger and bigger; <code>SELECT</code> statements slow down, well, because there are more rows to search through. Now I was wondering why not have one statistics table for each user and change the query syntax to something like this instead: <pre class="prettyprint"><code>SELECT * FROM statistics_1 </code></pre> where <code>1</code> represents the <code>user_id</code> obviously. This way, no <code>INDEX</code> is needed and there is far less data in each table, so <code>INSERT</code> and <code>SELECT</code> statements should be much faster. Now my questions again: Are there any real world disadvantages to handle so many tables (in my case 20,000) instead of using of using one table with an <code>INDEX</code>? Would my approach actually speed things up or might the lookup for the table eventually slow down things more than everything?

Creating 20,000 tables is a bad idea. You'll need 40,000 tables before long, and then more. I called this syndrome Metadata Tribbles in my book SQL Antipatterns. You see this happen every time you plan to create a "table per X" or a "column per X". This does cause real performance problems when you have tens of thousands of tables. Each table requires MySQL to maintain internal data structures, file descriptors, a data dictionary, etc. There are also practical operational consequences. Do you really want to create a system that requires you to create a new table every time a new user signs up? Instead, I'd recommend you use MySQL Partitioning. Here's an example of partitioning the table: <pre class="prettyprint"><code>CREATE TABLE statistics ( id INT AUTO_INCREMENT NOT NULL, user_id INT NOT NULL, PRIMARY KEY (id, user_id) ) PARTITION BY HASH(user_id) PARTITIONS 101; </code></pre> This gives you the benefit of defining one logical table, while also dividing the table into many physical tables for faster access when you query for a specific value of the partition key. For example, When you run a query like your example, MySQL accesses only the correct partition containing the specific user_id: <pre class="prettyprint"><code>mysql> EXPLAIN PARTITIONS SELECT * FROM statistics WHERE user_id = 1\G *************************** 1. row *************************** id: 1 select_type: SIMPLE table: statistics partitions: p1 <--- this shows it touches only one partition type: index possible_keys: NULL key: PRIMARY key_len: 8 ref: NULL rows: 2 Extra: Using where; Using index </code></pre> The HASH method of partitioning means that the rows are placed in a partition by a modulus of the integer partition key. This does mean that many user_id's map to the same partition, but each partition would have only 1/Nth as many rows on average (where N is the number of partitions). And you define the table with a constant number of partitions, so you don't have to expand it every time you get a new user. You can choose any number of partitions up to 1024 (or 8192 in MySQL 5.6), but some people have reported performance problems when they go that high. It is recommended to use a prime number of partitions. In case your user_id values follow a pattern (like using only even numbers), using a prime number of partitions helps distribute the data more evenly. <hr> Re your questions in comment: <blockquote> How could I determine a resonable number of partitions? </blockquote> For HASH partitioning, if you use 101 partitions like I show in the example above, then any given partition has about 1% of your rows on average. You said your statistics table has 30 million rows, so if you use this partitioning, you would have only 300k rows per partition. That is much easier for MySQL to read through. You can (and should) use indexes as well -- each partition will have its own index, and it will be only 1% as large as the index on the whole unpartitioned table would be. So the answer to how can you determine a reasonable number of partitions is: how big is your whole table, and how big do you want the partitions to be on average? <blockquote> Shouldn't the amount of partitions grow over time? If so: How can I automate that? </blockquote> The number of partitions doesn't necessarily need to grow if you use HASH partitioning. Eventually you may have 30 billion rows total, but I have found that when your data volume grows by orders of magnitude, that demands a new architecture anyway. If your data grow that large, you probably need sharding over multiple servers as well as partitioning into multiple tables. That said, you can re-partition a table with ALTER TABLE: <pre class="prettyprint"><code>ALTER TABLE statistics PARTITION BY HASH(user_id) PARTITIONS 401; </code></pre> This has to restructure the table (like most ALTER TABLE changes), so expect it to take a while. You may want to monitor the size of data and indexes in partitions: <pre class="prettyprint"><code>SELECT table_schema, table_name, table_rows, data_length, index_length FROM INFORMATION_SCHEMA.PARTITIONS WHERE partition_method IS NOT NULL; </code></pre> Like with any table, you want the total size of active indexes to fit in your buffer pool, because if MySQL has to swap parts of indexes in and out of the buffer pool during SELECT queries, performance suffers. If you use RANGE or LIST partitioning, then adding, dropping, merging, and splitting partitions is much more common. See http://dev.mysql.com/doc/refman/5.6/en/partitioning-management-range-list.html I encourage you to read the manual section on partitioning, and also check out this nice presentation: Boost Performance With MySQL 5.1 Partitions.

MySQL performance: multiple tables vs. index on single table and partitions

Tags:

performance

indexing

mysql

I am wondering what is more efficient and faster in performance:
Having an index on one big table or multiple smaller tables without indexes?

Since this is a pretty abstract problem let me make it more practical:
I have one table with statistics about users (20,000 users and about 30 million rows overall). The table has about 10 columns including the user_id, actions, timestamps, etc.
Most common applications are: Inserting data by user_id and retrieving data by user_id (SELECT statements never include multiple user_id's).

Now so far I have an INDEX on the user_id and the query looks something like this

SELECT * FROM statistics WHERE user_id = 1

Now, with more and more rows the table gets slower and slower. INSERT statements slow down because the INDEX gets bigger and bigger; SELECT statements slow down, well, because there are more rows to search through.

Now I was wondering why not have one statistics table for each user and change the query syntax to something like this instead:

SELECT * FROM statistics_1

where 1 represents the user_id obviously.
This way, no INDEX is needed and there is far less data in each table, so INSERT and SELECT statements should be much faster.

Now my questions again:
Are there any real world disadvantages to handle so many tables (in my case 20,000) instead of using of using one table with an INDEX?
Would my approach actually speed things up or might the lookup for the table eventually slow down things more than everything?

814

asked May 23 '13 18:05

Horen

1 Answers

Creating 20,000 tables is a bad idea. You'll need 40,000 tables before long, and then more.

I called this syndrome Metadata Tribbles in my book SQL Antipatterns. You see this happen every time you plan to create a "table per X" or a "column per X".

This does cause real performance problems when you have tens of thousands of tables. Each table requires MySQL to maintain internal data structures, file descriptors, a data dictionary, etc.

There are also practical operational consequences. Do you really want to create a system that requires you to create a new table every time a new user signs up?

Instead, I'd recommend you use MySQL Partitioning.

Here's an example of partitioning the table:

CREATE TABLE statistics (   id INT AUTO_INCREMENT NOT NULL,   user_id INT NOT NULL,   PRIMARY KEY (id, user_id) ) PARTITION BY HASH(user_id) PARTITIONS 101;

This gives you the benefit of defining one logical table, while also dividing the table into many physical tables for faster access when you query for a specific value of the partition key.

For example, When you run a query like your example, MySQL accesses only the correct partition containing the specific user_id:

mysql> EXPLAIN PARTITIONS SELECT * FROM statistics WHERE user_id = 1\G *************************** 1. row ***************************            id: 1   select_type: SIMPLE         table: statistics    partitions: p1    <--- this shows it touches only one partition           type: index possible_keys: NULL           key: PRIMARY       key_len: 8           ref: NULL          rows: 2         Extra: Using where; Using index

The HASH method of partitioning means that the rows are placed in a partition by a modulus of the integer partition key. This does mean that many user_id's map to the same partition, but each partition would have only 1/Nth as many rows on average (where N is the number of partitions). And you define the table with a constant number of partitions, so you don't have to expand it every time you get a new user.

You can choose any number of partitions up to 1024 (or 8192 in MySQL 5.6), but some people have reported performance problems when they go that high.

It is recommended to use a prime number of partitions. In case your user_id values follow a pattern (like using only even numbers), using a prime number of partitions helps distribute the data more evenly.

Re your questions in comment:

How could I determine a resonable number of partitions?

For HASH partitioning, if you use 101 partitions like I show in the example above, then any given partition has about 1% of your rows on average. You said your statistics table has 30 million rows, so if you use this partitioning, you would have only 300k rows per partition. That is much easier for MySQL to read through. You can (and should) use indexes as well -- each partition will have its own index, and it will be only 1% as large as the index on the whole unpartitioned table would be.

So the answer to how can you determine a reasonable number of partitions is: how big is your whole table, and how big do you want the partitions to be on average?

Shouldn't the amount of partitions grow over time? If so: How can I automate that?

The number of partitions doesn't necessarily need to grow if you use HASH partitioning. Eventually you may have 30 billion rows total, but I have found that when your data volume grows by orders of magnitude, that demands a new architecture anyway. If your data grow that large, you probably need sharding over multiple servers as well as partitioning into multiple tables.

That said, you can re-partition a table with ALTER TABLE:

ALTER TABLE statistics PARTITION BY HASH(user_id) PARTITIONS 401;

This has to restructure the table (like most ALTER TABLE changes), so expect it to take a while.

You may want to monitor the size of data and indexes in partitions:

SELECT table_schema, table_name, table_rows, data_length, index_length FROM INFORMATION_SCHEMA.PARTITIONS WHERE partition_method IS NOT NULL;

Like with any table, you want the total size of active indexes to fit in your buffer pool, because if MySQL has to swap parts of indexes in and out of the buffer pool during SELECT queries, performance suffers.

If you use RANGE or LIST partitioning, then adding, dropping, merging, and splitting partitions is much more common. See http://dev.mysql.com/doc/refman/5.6/en/partitioning-management-range-list.html

I encourage you to read the manual section on partitioning, and also check out this nice presentation: Boost Performance With MySQL 5.1 Partitions.

answered Sep 22 '22 10:09

Bill Karwin

Related questions
                            
                                In Django, how do I select 100 random records from the database? [duplicate]
                            
                                Multiple inner joins with multiple tables
                            
                                Mysql (MariaDB 10.0.29): Set root password, but still can login without asking password?
                            
                                SQL Join on a column LIKE another column [duplicate]
                            
                                Entity Framework with MySql and Migrations failing because "max key length is 767 bytes"
                            
                                How to insert a file in MySQL database?
                            
                                Order by last 3 chars
                            
                                Hibernate auto create database
                            
                                virtualenv(python3.4), pip install mysqlclient error
                            
                                Truncate all tables in MySQL database that match a name pattern
                            
                                MYSQL Case in select statement for checking null
                            
                                How do I search a MySQL database for a specific column name? [duplicate]
                            
                                mysqli or die, does it have to die?
                            
                                MySQL changing authentication type from standard to caching_sha2_password
                            
                                SHOW TABLES statement with multiple LIKE values
                            
                                query "not equal" doesn't work
                            
                                Mysql where id is in array [duplicate]
                            
                                Codeigniter LIKE with wildcard(%)
                            
                                Laravel whereIn OR whereIn
                            
                                ALTER TABLE add constraint

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With