Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Amazon Redshift schema design

We are looking at Amazon Redshift to implement our Data Warehouse and I would like some suggestions on how to properly design Schemas in Redshift, please.

I am completely new to Redshift. In the past when I worked with "traditional" data warehouses, I was used to creating schemas such as "Source", "Stage", "Final", etc. to group all the database objects according to what stage the data was in.

By default, a database in Redshift has a single schema, which is named PUBLIC. So, my question to those who have worked with Redshift, does the approach that I have outlined above apply here? If not, I would love some suggestions.

Thanks.

like image 914
Raj Avatar asked Mar 03 '17 16:03

Raj


Video Answer


2 Answers

With my experience in working with Redshift, I can assert the following points with confidence:

  1. Multiple schema: You should create multiple schema and create tables accordingly. When you'll scale, it'll be easier for you to pin-point where exactly the table is supposed to be. Let us say, you have 3 schema, named production, aggregates and rough. Now, you know that the table production will contain the tables that are not supposed to be changed (mostly OLTP data) - such as user, order, transactions tables. Table aggregates will have aggregated data built over raw tables - such as number of orders placed per user per day per category. Finally, rough will contain any table that doesn't hold a business logic but is required for some temporary work - let us say to check the genre of movies for a list of 1 lakh users, which is shared with you in an excel file. Simply create a table in rough schema, perform your operations and drop the table. Now you very clearly know where you'll find the tables based on whether they are raw, aggregated or simply temporary tables.

  2. Public schema: Forget it exists. Any table that is not preceded with a schema name, gets created there. A lot of clutter - no point in storing any important data there.

  3. Cross schema joins: There's no stopping here. You may join as many tables from as many schema as required. In fact, it is desirable you create dimension tables and join on a PK later, rather than to keep all the information in a single table.

Spend some quality time in designing the schema and underlying table structure. When you expand, it'll be easier for you to classify things better in terms of access control. Do let me know if I've missed some obvious points.

like image 91
Yusuf Hassan Avatar answered Sep 27 '22 16:09

Yusuf Hassan


You can have multiple databases in a Redshift cluster but I would stick with one. You are correct that schemas (essentially namespaces) are a good way to divide things up. You can query across schemas but not databases.

I would avoid using the public schema as managing certain permissions there can be difficult (easier to deny someone access to public than prevent them from being able to create a table for example).

For best results if you have the time, learn about the permissions system up front. You want to create groups that have access to schemas or tables and add/remove users from groups to control what they can do. Once you have that going it becomes pretty easy to manage.

like image 39
systemjack Avatar answered Sep 27 '22 16:09

systemjack