Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why does Redshift not need materialized views or indexes?

In the Redshift FAQ under

Q: How does the performance of Amazon Redshift compare to most traditional databases for data warehousing and analytics?

It says the following:

Advanced Compression: Columnar data stores can be compressed much more than row-based data stores because similar data is stored sequentially on disk. Amazon Redshift employs multiple compression techniques and can often achieve significant compression relative to traditional relational data stores. In addition, Amazon Redshift doesn't require indexes or materialized views and so uses less space than traditional relational database systems. When loading data into an empty table, Amazon Redshift automatically samples your data and selects the most appropriate compression scheme.

Why is this the case?

like image 905
m0meni Avatar asked May 31 '16 13:05

m0meni


3 Answers

It's a bit disingenuous to be honest (in my opinion). Although RedShift has neither of these, I'm not sure that's the same as saying it wouldn't benefit from them.

Materialised Views

I have no real idea why they make this claim. Possibly because they consider the engine so performant that the gains from having them are minimal.

I would dispute this and the product I work on maintains its own materialised views and can show significant performance gains from doing so. Perhaps AWS believe I must be doing something wrong in the first place?

Indexes

RedShift does not have indexes.

It does have SORT ORDER which is exceptionally similar to a clustered index. It is simply a list of fields by which the data is ordered (like a composite clustered index).

It even has recently introduced INTERLEAVED SORT KEYS. This is a direct attempt to have multiple independent sort orders. Instead of ordering by a THEN b THEN c it effectively orders by each of them at the same time.

That becomes kind of possible because of how RedShift implements its column store.
- Each column is stored separately from each other column
- Each column is stored in 1MB blocks
- Each 1MB block has summary statistics

As well as being the storage pattern this effectively becomes a set of pseudo indexes.
- If the data is sorted by a then b then x
- But you want z = 1234
- RedShift looks at the block statistics (for column z) first
- Those stats will say the minimum and maximum values stored by that block
- This allows Redshift to skip many of those blocks in certain conditions
- This intern allows RedShift to identify which blocks to read from the other columns

like image 104
MatBailie Avatar answered Oct 09 '22 14:10

MatBailie


as of dec 2019, Redshift has a preview of materialized views: Announcement

from the documentation: A materialized view contains a precomputed result set, based on a SQL query over one or more base tables. You can issue SELECT statements to query a materialized view, in the same way that you can query other tables or views in the database. Amazon Redshift returns the precomputed results from the materialized view, without having to access the base tables at all. From the user standpoint, the query results are returned much faster compared to when retrieving the same data from the base tables.

like image 24
yamspog Avatar answered Oct 09 '22 13:10

yamspog


Indexes are basically used in OLTP systems to retrieve a specific or a small group of values. On the contrary, OLAP systems retrieve a large set of values and performs aggregation on the large set of values. Indexes would not be a right fit for OLAP systems. Instead it uses a secondary structure called zone maps with sort keys.

The indexes operate on B trees. The 'life without a btree' section in the below blog explains with examples how an index based out of btree affects OLAP workloads.

https://blog.chartio.com/blog/understanding-interleaved-sort-keys-in-amazon-redshift-part-1

The combination of columnar storage, compression codings, data distribution, compression, query compilations, optimization etc. provides the power to Redshift to be faster.

Implementing the above factors, reduces IO operations on Redshift and eventually providing better performance. To implement an efficient solution, it requires a great deal of knowledge on the above sections and as well as the on the queries that you would run on Amazon Redshift.

for eg. Redshift supports Sort keys, Compound Sort keys and Interleaved Sort keys. If your table structure is lineitem(orderid,linenumber,supplier,quantity,price,discount,tax,returnflat,shipdate). If you select orderid as your sort key but if your queries are based on shipdate, Redshift will be operating efficiently. If you have a composite sortkey on (orderid, shipdate) and if your query only on ship date, Redshift will not be operating efficiently. If you have an interleaved soft key on (orderid, shipdate) and if your query

Redshift does not support materialized views but it easily allows you to create (temporary/permant) tables by running select queries on existing tables. It eventually duplicates data but at the required format to be executed for queries (similar to materialized view) The below blog gives your some information on the above approach.

https://www.periscopedata.com/blog/faster-redshift-queries-with-materialized-views-lifetime-daily-arpu.html

Redshift does fare well with other systems like Hive, Impala, Spark, BQ etc. during one of our recent benchmark frameworks

like image 42
Mukund Avatar answered Oct 09 '22 13:10

Mukund