Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Postgres hstore for time series

I am new to postgres and am experimenting with the hstore extension.Looking for some guidance. I need to support basic reporting on timeseries data for various products that we sell. I have a large amount data in the format "Timestamp, Value" for each product. This data is available in a csv fle for each product.

I am thinking of using hstore to store this data in the key value format. Assuming that all the timeseries data for a single product can be stored in a single hstore object. I need to be able to query this data by specific times, say what was the value of a product at a given time? Also need to run simple queries like retrieving the times where the product costed more than $100. I'm planning to have a table with a product id column and an hstore column. But I am not very clear on how to make this work:

  1. The hstore column needs to be loaded from thousands of timestamp,value records that exist in a csv. The hstore should be appended whenever we get a new csv.
  2. The table needs to store the productId and corresponding Timeseries data. Can you please advise if using hstore would be helpful ? If yes then how can I load data from csv as explained above. Also, if there could be any impact on the performance on inserts/updates in the hstore, as data grows please share your experiences.
like image 976
zing Avatar asked Nov 13 '12 21:11

zing


People also ask

Is PostgreSQL good for time series data?

Postgres' built-in partitioning is super useful for managing time series data. By partitioning your Postgres table on a time column by range (thereby creating a time-partitioned table), you can have a table with much smaller partition tables and much smaller indexes on those partitions—instead of a single huge table.

What is Hstore in Postgres?

hstore is a PostgreSQL extension that implements the hstore data type. It's a key-value data type for PostgreSQL that's been around since before JSON and JSONB data types were added.


1 Answers

I do think you should start with a simple, normalised schema first, especially since you are new to PostgreSQL. Something like:

CREATE TABLE product_data
(
    product TEXT,  -- I'm making an assumption about the types of your columns
    time TIMESTAMP,
    value DOUBLE PRECISION,

    PRIMARY KEY (product, time);
);

I would definitely keep hstore and similar options in mind, if and when your data becomes large enough that efficiency is more important and simplicity. But note that all options have an efficiency tradeoff.

Do you know how much data you're going to support? Number of products, number of distinct timestamps for each product?

What other queries do you want to run? A query for the times where a single product cost more than $100 would benefit from an index on (product, value), if the product has many distinct timestamps.

Other options

hstore is most useful if you want to store a table set of arbitrary key-value pairs in a row. You could use it here, with a row for each product, and each distinct timestamp for that product being a key in the product's table. The downsides are that keys and values in hstore are text, whereas your keys are timestamps, and your values are numbers of some kind. So there will be a certain reduction in type checking, and a certain increase in type casting cost required. Another possible downside is that some queries on the hstore might not use indexes very efficiently. The above table can use simple btree indexes for range queries (say you want to pull out the values between two dates for a product). But hstore indexes are much more limited; you can use a gist or gin index on an hstore column to find all the rows that feature a certain key.

Another option (which I've played with and use experimentally for some of my databases) is arrays. Basically, each product will have an array of values, and each timestamp maps to an index in the array. This is easy if the timestamps are perfectly regular. For example, if all your products had a value every hour for every day, you could use a table like this:

CREATE TABLE product_data
(
    product TEXT,
    day DATE,
    values DOUBLE PRECISION[], -- An array from 0 to 23.

    PRIMARY KEY (product, day);
);

You can construct views and indexes to make querying this table moderate easy. (I wrote a blog post on this technique at http://ejrh.wordpress.com/2011/03/20/vector-denormalisation-in-postgresql/.)

But my advice is still: start with a simple table, then explore ways to improve efficiency when you know you're going to need them.

like image 200
Edmund Avatar answered Dec 31 '22 19:12

Edmund