I am new to postgres and am experimenting with the hstore extension.Looking for some guidance. I need to support basic reporting on timeseries data for various products that we sell. I have a large amount data in the format "Timestamp, Value" for each product. This data is available in a csv fle for each product.
I am thinking of using hstore to store this data in the key value format. Assuming that all the timeseries data for a single product can be stored in a single hstore object. I need to be able to query this data by specific times, say what was the value of a product at a given time? Also need to run simple queries like retrieving the times where the product costed more than $100. I'm planning to have a table with a product id column and an hstore column. But I am not very clear on how to make this work:
Postgres' built-in partitioning is super useful for managing time series data. By partitioning your Postgres table on a time column by range (thereby creating a time-partitioned table), you can have a table with much smaller partition tables and much smaller indexes on those partitions—instead of a single huge table.
hstore is a PostgreSQL extension that implements the hstore data type. It's a key-value data type for PostgreSQL that's been around since before JSON and JSONB data types were added.
I do think you should start with a simple, normalised schema first, especially since you are new to PostgreSQL. Something like:
CREATE TABLE product_data
(
product TEXT, -- I'm making an assumption about the types of your columns
time TIMESTAMP,
value DOUBLE PRECISION,
PRIMARY KEY (product, time);
);
I would definitely keep hstore
and similar options in mind, if and when your data becomes large enough that efficiency is more important and simplicity. But note that all options have an efficiency tradeoff.
Do you know how much data you're going to support? Number of products, number of distinct timestamps for each product?
What other queries do you want to run? A query for the times where a single product cost more than $100 would benefit from an index on (product, value)
, if the product has many distinct timestamps.
Other options
hstore
is most useful if you want to store a table set of arbitrary key-value pairs in a row. You could use it here, with a row for each product, and each distinct timestamp for that product being a key in the product's table. The downsides are that keys and values in hstore
are text, whereas your keys are timestamps, and your values are numbers of some kind. So there will be a certain reduction in type checking, and a certain increase in type casting cost required. Another possible downside is that some queries on the hstore
might not use indexes very efficiently. The above table can use simple btree indexes for range queries (say you want to pull out the values between two dates for a product). But hstore indexes are much more limited; you can use a gist or gin index on an hstore column to find all the rows that feature a certain key.
Another option (which I've played with and use experimentally for some of my databases) is arrays. Basically, each product will have an array of values, and each timestamp maps to an index in the array. This is easy if the timestamps are perfectly regular. For example, if all your products had a value every hour for every day, you could use a table like this:
CREATE TABLE product_data
(
product TEXT,
day DATE,
values DOUBLE PRECISION[], -- An array from 0 to 23.
PRIMARY KEY (product, day);
);
You can construct views and indexes to make querying this table moderate easy. (I wrote a blog post on this technique at http://ejrh.wordpress.com/2011/03/20/vector-denormalisation-in-postgresql/.)
But my advice is still: start with a simple table, then explore ways to improve efficiency when you know you're going to need them.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With