I'm looking to create denormalized access to data essentially for reporting purposes (and thus to avoid joins and gain performance). I have two solutions in mind, but I'm looking for (a) other potential solutions, and (b) what tradeoffs I should be considering. I'm using SQL Server 2008 R2.
In one solution, I could create an indexed view over the query which does that joins that I care about. My understanding is that this does materialize under the hood but is tricky and might not guarantee good performance (and there's a vociferous debate about the performance of views).
In another solution, I could build the machinery to create a table, populate it with the data I care about, and in a transaction swap it out for the existing table.
The former seems risky/magical to me; the latter seems janky, error prone, and likely to impact things like query plans. Can someone help shed light on this please?
Denormalization has a place with SQL and NoSQL databases, as well as in data warehousing.
Normalization uses optimized memory and hence faster in performance. On the other hand, Denormalization introduces some sort of wastage of memory. Normalization maintains data integrity i.e. any addition or deletion of data from the table will not create any mismatch in the relationship of the tables.
The dimension tables of descriptive information are highly denormalized with detailed and hierarchical roll-up attributes in the same table. Meanwhile, the fact tables with performance metrics are typically normalized.
Views are inlined into the query plan at a very early stage in the optimization pipeline. Neither do they hurt nor do they improve performance.
Indexed views are also inlined. It doesn't matter whether you have written your query to reference a view or whether you have pasted the view definition. On Enterprise Edition, the optimizer tries to match parts of the query against indexed views later in the pipeline. This is indeed unreliable. For simple cases it works fine but I have seen it fail randomly. There is no guarantee.
There is a solution for that: WITH (NOINDEX)
forces the optimizer to not inline the view but to use its index. This is 100% reliable.
If you can fit your query pattern into an indexed view and are able to use that hint (you can create a wrapper view that always contains that hint) then indexed views are a nice automatic and consistent solution.
The former seems risky/magical to me
With that hint, there is little surprise left in indexed views. If it helps: I have production experience with indexed views. They work fine.
the latter seems janky, error prone, and likely to impact things like query plans. Can someone help shed light on this please?
That is true. Every bug has the potential to cause data corruption. You better not forget any place where data is written, or else your denormalized data becomes inconsistent.
I recommend you use indexed views for these reasons if there is no good reason against using them.
Firstly, indexed views have two gotchas:
1) An updated row in the base table must be able to propagate through to the indexed view without having to reference any other rows in the base table. This is why you can use SUM() and COUNT_BIG(), since those aggregates can be updated just by knowing what is in the changed rows, but you can't use MIN() or MAX(), since you've have to look back through the base table to make sure.
All the seemingly arbitrary restrictions on indexed views (no TVFs, no subqueries, no unions, etc) boil down the the above reason. The engine has to be able to maintain the index without constantly looking at the whole base table.
2) AFTER triggers on the base table still get processed before the indexed views are updated.
These limit the complexity of what you can accomplish with indexed views alone.
Secondly, test that denormalization will actually help. If you are just instantiating simple joins and you are already I/O bound, that will make the bottleneck worse. On the other hand, if you are precomputing large aggregates or taking vertical slices of extremely wide joins, it's more likely to be an improvement.
Thirdly, to use indexed views, use a pattern like this:
CREATE TABLE huge_data_table ( ... )
GO
CREATE VIEW huge_data_table_monthly_summary_index AS
SELECT
YEAR(...) AS [year_...]
,MONTH(...) AS [month_...]
,SUM(...) AS [sum_...]
,COUNT_BIG(*) AS [count_...]
FROM huge_data_table
GROUP BY YEAR(...),MONTH(...)
GO
CREATE UNIQUE CLUSTERED INDEX UC__xyz
ON huge_data_table_monthly_summary_index
([year_...],[month_...])
GO
CREATE VIEW monthly_summary AS
SELECT
[year_...]
,[month_...]
,[sum_...]
,[count_...]
FROM huge_data_table_monthly_summary_index WITH (NOEXPAND) --force use of the view's index
GO
Then you can query monthly_summary
and get the aggregates without having to recompute them every time. The engine updates huge_data_table_monthly_summary_index
automatically whenever huge_data_table
changes.
It may seem like magic, but it works. Use it as far as you can in lieu of triggers/jobs/etc to synchronize tables. I have found that I can usually break down a complex reporting job into smaller, simpler pieces that can use indexed views, and joining the intermediate results on the fly at reporting time turns out to be fast enough to get the job done.
If that's not enough for your needs, you are probably in the realm of log shipping or mirroring to a separate reporting server. There's not much middle ground where syncing manually makes sense.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With