We have an E-commerce portal with a Postgresql 9.1 database. One very important table has at the moment 32 million records. If we want to deliver all items this table would grow to 320 million records, mostly dates. Which would be to heavy.
So we are thinking about horizontal partitioning / sharding. We can divide items in this table into 12 pieces horizontal (1 per month). What would be the best steps and technics to do so? Would horizontal partitioning within the database be good enough or do we have to start thinking about sharding?
Two common types of horizontal partitioning are range and hash partitioning. On the other hand, vertical partitioning allows a table to be partitioned into disjoint sets of columns.
Horizontal partitioning (often called sharding). In this strategy, each partition is a separate data store, but all partitions have the same schema. Each partition is known as a shard and holds a specific subset of the data, such as all the orders for a specific set of customers.
While 320 million is not small, it's not really huge either.
It largely depends on the queries you run on the table. If you always include the partition key in your queries then "regular" partitioning would probably work.
An example for this can be found in the PostgreSQL wiki:
http://wiki.postgresql.org/wiki/Month_based_partitioning
The manual also explains some of the caveats of partitioning:
http://www.postgresql.org/docs/current/interactive/ddl-partitioning.html
If you are thinking about sharding, you might read how Instagram (which is powered by PostgreSQL) has implemented that:
http://instagram-engineering.tumblr.com/post/10853187575/sharding-ids-at-instagram
If you have mostly read-queries, another option might be to use streaming replication to setup multiple servers and distribute the read queries by connecting to the hot-standby for read access and connecting to the master for write access. I think pg-pool II can do that (somewhat) automatically. That can be combined with partitioning to further reduce the query runtime.
If you are adventurous and don't have very immediate needs to do so, you might also consider Postgres-XC which promises to support transparent horizontal scaling:
http://postgres-xc.sourceforge.net/
There is no final release yet, but it looks like this isn't taking too long
Here is my sample code for partitioning: t_master is a view to be select/insert/update/delete in your application t_1 and t_2 is the underlying tables actually storing the data.
create or replace view t_master(id, col1)
as
select id, col1 from t_1
union all
select id, col1 from t_2
CREATE TABLE t_1
(
id bigint PRIMARY KEY,
col1 text
);
CREATE TABLE t_2
(
id bigint PRIMARY KEY,
col1 text
);
CREATE OR REPLACE FUNCTION t_insert_partition_function()
returns TRIGGER AS $$
begin
raise notice '%s', 'hello';
execute 'insert into t_'
|| ( mod(NEW.id, 2)+ 1 )
|| ' values ( $1, $2 )' USING NEW.id, NEW.col1 ;
RETURN NULL;
end;
$$
LANGUAGE plpgsql;
CREATE OR REPLACE FUNCTION t_update_partition_function()
returns TRIGGER AS $$
begin
raise notice '%s', 'hello';
execute 'update t_'
|| ( mod(NEW.id, 2)+ 1 )
|| ' set id = $1, col1 = $2 where id = $1'
USING NEW.id, NEW.col1 ;
RETURN NULL;
end;
$$
LANGUAGE plpgsql;
CREATE OR REPLACE FUNCTION t_delete_partition_function()
returns TRIGGER AS $$
begin
raise notice '%s', 'hello';
execute 'delete from t_'
|| ( mod(OLD.id, 2)+ 1 )
|| ' where id = $1'
USING OLD.id;
RETURN NULL;
end;
$$
LANGUAGE plpgsql;
CREATE TRIGGER t_insert_partition_trigger instead of INSERT
ON t_master FOR each row
execute procedure t_insert_partition_function();
CREATE TRIGGER t_update_partition_trigger instead of update
ON t_master FOR each row
execute procedure t_update_partition_function();
CREATE TRIGGER t_delete_partition_trigger instead of delete
ON t_master FOR each row
execute procedure t_delete_partition_function();
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With