Anyone know anything about OLAP Internals?

Tags:

I know a bit about database internals. I've actually implemented a small, simple relational database engine before, using ISAM structures on disk and BTree indexes and all that sort of thing. It was fun, and very educational. I know that I'm much more cognizant about carefully designing database schemas and writing queries now that I know a little bit more about how RDBMSs work under the hood.

But I don't know anything about multidimensional OLAP data models, and I've had a hard time finding any useful information on the internet.

How is the information stored on disk? What data structures comprise the cube? If a MOLAP model doesn't use tables, with columns and records, then... what? Especially in highly dimensional data, what kinds of data structures make the MOLAP model so efficient? Do MOLAP implementations use something analogous to RDBMS indexes?

Why are OLAP servers so much better at processing ad hoc queries? The same sorts of aggregations that might take hours to process in an ordinary relational database can be processed in milliseconds in an OLTP cube. What are the underlying mechanics of the model that make that possible?

767

asked Apr 10 '09 04:04

benjismith

2 Answers

I've implemented a couple of systems that mimicked what OLAP cubes do, and here are a couple of things we did to get them to work.

The core data was held in an n-dimensional array, all in memory, and all the keys were implemented via hierarchies of pointers to the underlying array. In this way we could have multiple different sets of keys for the same data. The data in the array was the equivalent of the fact table, often it would only have a couple of pieces of data, in one instance this was price and number sold.
The underlying array was often sparse, so once it was created we used to remove all the blank cells to save memory - lots of hardcore pointer arithmetic but it worked.
As we had hierarchies of keys, we could write routines quite easily to drill down/up a hierarchy easily. For instance we would access year of data, by going through the month keys, which in turn mapped to days and/or weeks. At each level we would aggregate data as part of building the cube - made calculations much faster.
We didn't implement any kind of query language, but we did support drill down on all axis (up to 7 in our biggest cubes), and that was tied directly to the UI which the users liked.
We implemented core stuff in C++, but these days I reckon C# could be fast enough, but I'd worry about how to implement sparse arrays.

Hope that helps, sound interesting.

123

answered Sep 20 '22 00:09

MrTelly

The book Microsoft SQL Server 2008 Analysis Services Unleashed spells out some of the particularities of SSAS 2008 in decent detail. It's not quite a "here's exactly how SSAS works under the hood", but it's pretty suggestive, especially on the data structure side. (It's not quite as detailed/specific about the exact algorithms.) A few of the things I, as an amateur in this area, gathered from this book. This is all about SSAS MOLAP:

Despite all the talk about multi-dimensional cubes, fact table (aka measure group) data is still, to a first approximation, ultimately stored in basically 2D tables, one row per fact. A number of OLAP operations seem to ultimately consist of iterating over rows in 2D tables.
The data is potentially much smaller inside MOLAP than inside a corresponding SQL table, however. One trick is that each unique string is stored only once, in a "string store". Data structures can then refer to strings in a more compact form (by string ID, basically). SSAS also compresses rows within the MOLAP store in some form. This shrinking I assume lets more of the data stay in RAM simultaneously, which is good.
Similarly, SSAS can often iterate over a subset of the data rather than the full dataset. A few mechanisms are in play:
- By default, SSAS builds a hash index for each dimension/attribute value; it thus knows "right away" which pages on disk contain the relevant data for, say, Year=1997.
- There's a caching architecture where relevant subsets of the data are stored in RAM separate from the whole dataset. For example, you might have cached a subcube that has only a few of your fields, and that only pertains to the data from 1997. If a query is asking only about 1997, then it will iterate only over that subcube, thereby speeding things up. (But note that a "subcube" is, to a first approximation, just a 2D table.)
- If you're predefined aggregates, then these smaller subsets can also be precomputed at cube processing time, rather than merely computed/cached on demand.
SSAS fact table rows are fixed size, which presumibly helps in some form. (In SQL, in constrast, you might have variable-width string columns.)
The caching architecture also means that, once an aggregation has been computed, it doesn't need to be refetched from disk and recomputed again and again.

These are some of the factors in play in SSAS anyway. I can't claim that there aren't other vital things as well.

answered Sep 23 '22 00:09

Chris

Related questions
                            
                                fast & efficient least squares fit algorithm in C?
                            
                                The simplest algorithm for poker hand evaluation
                            
                                Shortest distance between points algorithm
                            
                                how to measure running time of algorithms in python [duplicate]
                            
                                How does Firefox's 'awesome' bar match strings?
                            
                                Smoothing data from a sensor
                            
                                Why is O(n) better than O( nlog(n) )?
                            
                                Quickselect Algorithm - Simplified Explanation
                            
                                What are the rules for the "Ω(n log n) barrier" for sorting algorithms?
                            
                                PyMC: Taking advantage of sparse model structure in Adaptive Metropolis MCMC
                            
                                How can I generate an "unlimited" world?
                            
                                How Can I Best Guess the Encoding when the BOM (Byte Order Mark) is Missing?
                            
                                Travelling Salesman with multiple salesmen?
                            
                                Efficient algorithm for finding all maximal subsets
                            
                                Distribute points on a circle as evenly as possible
                            
                                Towers of Hanoi with K pegs
                            
                                Sorting algorithm to implement highest total combinations
                            
                                Compare two integer arrays with same length
                            
                                Why is it important to delete files in-order to remove them faster?
                            
                                Is it possible to invert an array with constant extra space?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Anyone know anything about OLAP Internals?

Tags:

algorithm

data-structures

theory

olap

cubes

benjismith

People also ask

2 Answers

MrTelly

Chris

Recent Activity

Donate For Us