Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

in postgresql, are partitions or multiple databases more efficient?

have an application in which many companies post information. the data from each company is self contained - there is no data overlap.

performance-wise, is it better to:

  • keep the company ID on each row of each table and have each index use it?
  • partition each table according to the company ID
  • partition and create a user to access each company to ensure security
  • create multiple databases, one for each company

web-based application with persistent connections.

my thoughts:

  • new pg connections are expensive, so a single database creates less new connections
  • having only one copy of the dictionary seems more efficient than 200 or so
  • multiple databases are certainly safer from programmer error
  • if application specs should change so companies share, multiple data base would be difficult to implement
like image 329
cc young Avatar asked Dec 08 '11 14:12

cc young


1 Answers

I'd recommend searching for info on the PostgreSQL mailing lists about multi-tenanted design. There's been lots of discussion there, and the answer boils down to "it depends". There are trade-offs every way between guaranteed isolation, performance, and maintainability.

A common approach is to use a single database, but one schema (namespace) per customer with the same table structure in each schema, plus a shared or common schema for data that's the same across all of them. A PostgreSQL schema is like a MySQL "database" in that you can query across different schema but they're isolated by default. With customer data in separate schema you can use the search_path setting, usually via ALTER USER customername SET search_path = 'customerschema, sharedschema' to ensure each customer sees their data and only their data.

For additional protection, you should REVOKE ALL FROM SCHEMA customerschema FROM public then GRANTALL ON SCHEMA customerschema TO thecustomer so they're the only one with any access to it, doing the same to each of their tables. Your connection pool then can log in with a fixed user account that has no GRANTed access to any customer schema but has the right to SET ROLE to become any customer. (Do that by giving them membership of each customer role with NOINHERIT set so rights have to be explicitly claimed via SET ROLE). The connection should immediately SET ROLE to the customer it's currently operating as. That'll allow you to avoid the overhead of making new connections for each customer while maintaining strong protection against programmer error leading to access to the wrong customer's data. So long as the pool does a DISCARD ALL and/or a RESET ROLE before handing connections out to the next client, that's going to give you very strong isolation without the frustration of individual connections per-user.

If your web app environment doesn't have a decent connection pool built-in (say, you're using PHP with persistent connections) then you really need to put a good connection pool in place between Pg and the web server anyway, because too many connections to the backend will hurt your performance. PgBouncer and PgPool-II are the best options, and handily can take care of doing the DISCARD ALL and RESET ROLE for you during connection hand-off.

The main downside of this approach is the overhead with maintaining that many tables, since your base set of non-shared tables is cloned for each customer. It'll add up as customer numbers grow, to the point where the sheer number of tables to examine during autovacuum runs starts to get expensive and where any operation that scales based on the total number of tables in the DB slows down. This is more of an issue if you're thinking of having many thousands or tens of thousands of customers in the same DB, but I strongly recommend you do some scaling tests with this design using dummy data before committing to it.

The ideal approach is likely to be single tables with automatic row-level security controlling tuple visibility, but unfortunately that's something PostgreSQL doesn't have yet. It looks like it's on the way thanks to the SEPostgreSQL work adding suitable infrastructure and APIs, but it's not in 9.1.

like image 116
Craig Ringer Avatar answered Nov 15 '22 15:11

Craig Ringer