What is the more efficient (in terms of query performance) database table design - long or wide?
I.e., this
id size price
1 S 12.4
1 M 23.1
1 L 33.3
2 S 3.3
2 M 5.3
2 L 11.0
versus this
id S M L
1 12.4 23.1 33.3
2 3.3 5.3 11.0
Generally (I reckon) it comes down to the comparison of performance between GROUP BY
and selecting the columns directly:
SELECT AVG(price) FROM table GROUP BY size
or
SELECT AVG(S), AVG(M), AVG(L) FROM table
Second one is a bit longer to write (in terms of many columns), but what about the performance of the two? If possible, what are the general advantages/disadvantages of each of these tables formats?
The long is more flexible in use. It allows you to filter on size
for example
SELECT MAX(price) where size='L'
Also it allows for indexing on the size
and on the id
. This speeds up the GROUP BY
and any queries where other tables are joined on id
and/or size
such a product stock table.
First of all, these are two different data models suitable for different purposes.
That being said, I'd expect1 the second model will be faster for aggregation, simply because the data is packed more compactly, therefore needing less I/O:
{size, price}
. The alternative to index is too slow when the data is too large to fit in RAM.Since the first approach requires table + index and the second one just the table, the cache utilization is better in the second case. Even if we disregard caching and compare the index (without table) in the first model with the table in the second model, I suspect the index will be larger than the table, simply because it physically records the size
and has unused "holes" typical for B-Trees (though the same is true for the table if it is clustered).
And finally, the second model does not have the index maintenance overhead, which could impact the INSERT/UPDATE/DELETE performance.
Other than that, you can consider caching the SUM and COUNT in a separate table containing just one row. Update both the SUM and COUNT via triggers whenever a row is inserted, updated or deleted in the main table. You can then easily get the current AVG, simply by dividing SUM and COUNT.
1 But you should really measure on representative amounts of data to be sure.
2 Since there is no WHERE clause in your query, all rows will be scanned. Indexes are only useful for getting a relatively small subset of table's rows (and sometimes for index-only scans). As a rough rule of thumb, if more than 10% of rows in the table are needed, indexes won't help and the DBMS will often opt for a full table scan even when indexes are available.
The first option results in more rows and will generally be slower than the second option.
However, as Deltalima also indicated, the first option is more flexible. Not only when it comes to different query options, but also if/when you one day need to extend the table with other sizes, colors etc.
Unless you have a very large dataset or need ultra-fast lookup time, you'll probably be better off with the first option.
If you do have or need a very large dataset, you may be better off creating a table with pre-calculated summary values.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With