Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

PostgreSQL tuning best practices for data warehousing

Tags:

postgresql

I have found plenty of online and print guides on how to tune and optimize performance for Postgres for OLTP applications, but I haven't found anything of the sort specific to Data Warehousing applications. Since there are so many differences in the types of workload, I'm sure there has to be some differences in how the databases are managed and tuned.

Some of my own:

  • I have found from the DDL side that I use indexes a lot more liberally, since I usually only worry about inserts once a day and can do batch inserts with index rebuilds.

  • I will typically use integer surrogate keys to data that typically has more than one natural key for faster joins

  • I will usually define and maintain a very comprehensive date table that has prebuilt date manipulations (fiscal date as opposed to calendar date, fiscal year-month, starting day of the week, etc) and use it liberally as opposed to using functions in select statements and where statements. This usually helps during CPU-bound aggregate queries.

I was hoping that I would find some information on memory management and other database settings, but I would be happy to hear any useful best practices specific to Postgres-based Data Warehousing.

like image 995
TheDarkSaint Avatar asked Sep 04 '12 04:09

TheDarkSaint


People also ask

Is PostgreSQL good for data warehouse?

PostgreSQL Data Warehouse leverages OLTP and OLAP to manage streamlined communications between databases. For example, it's easier to store the data and communicate with databases using OLTP using OLAP. These features make PostgreSQL an organization's favorite for OLAP as a data warehouse.

What is tuning in data warehouse?

Definition. Optimization and tuning in data warehouses are the processes of selecting adequate optimization techniques in order to make queries and updates run faster and to maintain their performance by maximizing the use of data warehouse system resources.

Is PostgreSQL good for OLAP?

As I said before, an excellent feature of PostgreSQL is its ability to be used for both OLTP and OLAP. This makes it easier for the databases that are using OLAP to store the data to speak to the databases using OLTP to create the latest data.


1 Answers

My experience (admittedly on a pretty small scale when it comes to data warehouses):

  • Like you mention, pre-aggregating data is easily the most important thing, as it reduces the amount of data that needs to be read by many orders of magnitude.
  • Avoid short writing transactions, subtransactions and savepoints. This includes exception handling in PL/pgSQL. These burn through the available "transaction ID" space quickly, and cause expensive "wraparound" vacuums that need to rewrite whole tables.
  • I found that partitioning tables such that each partition individually can fit in the kernel's cache is good for maintenance and migrations, if you ever need to do any. This means you can recreate all indexes on a partition with just 1 seq scan from disk, instead of one scan for each index.
  • Like Chris already mentioned, be generous with work_mem and maintenance_work_mem; if your workload doesn't fit in RAM then keeping more temporary data in memory saves I/O and CPU time due to smarter query plans (most importantly HashAggregate).
  • If you need to do huge sorts, it can help to buy a dedicated SSD for storing the temporary files.
like image 86
intgr Avatar answered Sep 28 '22 22:09

intgr