Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

MySQL: Long table vs wide table

What is the more efficient (in terms of query performance) database table design - long or wide?

I.e., this

id size price
1  S    12.4  
1  M    23.1
1  L    33.3
2  S    3.3
2  M    5.3
2  L    11.0

versus this

id  S     M     L
1   12.4  23.1  33.3
2   3.3   5.3   11.0

Generally (I reckon) it comes down to the comparison of performance between GROUP BY and selecting the columns directly:

SELECT AVG(price) FROM table GROUP BY size

or

SELECT AVG(S), AVG(M), AVG(L) FROM table

Second one is a bit longer to write (in terms of many columns), but what about the performance of the two? If possible, what are the general advantages/disadvantages of each of these tables formats?

like image 253
sashkello Avatar asked Apr 15 '13 23:04

sashkello


3 Answers

The long is more flexible in use. It allows you to filter on size for example

SELECT MAX(price) where size='L' 

Also it allows for indexing on the size and on the id. This speeds up the GROUP BY and any queries where other tables are joined on id and/or size such a product stock table.

like image 22
DeltaLima Avatar answered Sep 27 '22 17:09

DeltaLima


First of all, these are two different data models suitable for different purposes.

That being said, I'd expect1 the second model will be faster for aggregation, simply because the data is packed more compactly, therefore needing less I/O:

  • The GROUP BY in the first model can be satisfied by a full scan on the index {size, price}. The alternative to index is too slow when the data is too large to fit in RAM.
  • The query in the second model can be satisfied by a full table scan. No index needed2.

Since the first approach requires table + index and the second one just the table, the cache utilization is better in the second case. Even if we disregard caching and compare the index (without table) in the first model with the table in the second model, I suspect the index will be larger than the table, simply because it physically records the size and has unused "holes" typical for B-Trees (though the same is true for the table if it is clustered).

And finally, the second model does not have the index maintenance overhead, which could impact the INSERT/UPDATE/DELETE performance.

Other than that, you can consider caching the SUM and COUNT in a separate table containing just one row. Update both the SUM and COUNT via triggers whenever a row is inserted, updated or deleted in the main table. You can then easily get the current AVG, simply by dividing SUM and COUNT.


1 But you should really measure on representative amounts of data to be sure.

2 Since there is no WHERE clause in your query, all rows will be scanned. Indexes are only useful for getting a relatively small subset of table's rows (and sometimes for index-only scans). As a rough rule of thumb, if more than 10% of rows in the table are needed, indexes won't help and the DBMS will often opt for a full table scan even when indexes are available.

like image 64
Branko Dimitrijevic Avatar answered Sep 27 '22 16:09

Branko Dimitrijevic


The first option results in more rows and will generally be slower than the second option.

However, as Deltalima also indicated, the first option is more flexible. Not only when it comes to different query options, but also if/when you one day need to extend the table with other sizes, colors etc.

Unless you have a very large dataset or need ultra-fast lookup time, you'll probably be better off with the first option.

If you do have or need a very large dataset, you may be better off creating a table with pre-calculated summary values.

like image 26
Martin Israelsen Avatar answered Sep 27 '22 17:09

Martin Israelsen