Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Representing Sparse Data in PostgreSQL

What's the best way to represent a sparse data matrix in PostgreSQL? The two obvious methods I see are:

  1. Store data in a single a table with a separate column for every conceivable feature (potentially millions), but with a default value of NULL for unused features. This is conceptually very simple, but I know that with most RDMS implementations, that this is typically very inefficient, since the NULL values ususually takes up some space. However, I read an article (can't find its link unfortunately) that claimed PG doesn't take up data for NULL values, making it better suited for storing sparse data.

  2. Create separate "row" and "column" tables, as well as an intermediate table to link them and store the value for the column at that row. I believe this is the more traditional RDMS solution, but there's more complexity and overhead associated with it.

I also found PostgreDynamic, which claims to better support sparse data, but I don't want to switch my entire database server to a PG fork just for this feature.

Are there any other solutions? Which one should I use?

like image 211
Cerin Avatar asked Apr 07 '10 14:04

Cerin


4 Answers

I'm assuming you're thinking of sparse matrices from mathematical context: http://en.wikipedia.org/wiki/Sparse_matrix (The storing techniques described there are for memory storage (fast arithmetic operation), not persistent storage (low disk usage).)

Since one usually do operate on this matrices on client side rather than on server side a SQL-ARRAY[] is the best choice!

The question is how to take advantage of the sparsity of the matrix? Here the results from some investigations.

Setup:

  • Postgres 8.4
  • Matrices w/ 400*400 elements in double precision (8 Bytes) --> 1.28MiB raw size per matrix
  • 33% non-zero elements --> 427kiB effective size per matrix
  • averaged using ~1000 different random populated matrices

Competing methods:

  • Rely on the automatic server side compression of columns with SET STORAGE MAIN or EXTENDED.
  • Only store the non-zero elements plus a bitmap (bit varying(xx)) describing where to locate the non-zero elements in the matrix. (One double precision is 64 times bigger than one bit. In theory (ignoring overheads) this method should be an improvement if <=98% are non-zero ;-).) Server side compression is activated.
  • Replace the zeros in the matrix with NULL. (The RDBMSs are very effective in storing NULLs.) Server side compression is activated.

(Indexing of non-zero elements using a 2nd index-ARRAY[] is not very promising and therefor not tested.)

Results:

  • Automatic compression
    • no extra implementation efforts
    • no reduced network traffic
    • minimal compression overhead
    • persistent storage = 39% of the raw size
  • Bitmap
    • acceptable implementation effort
    • network traffic slightly decreased; dependent on sparsity
    • persistent storage = 33.9% of the raw size
  • Replace zeros with NULLs
    • some implementation effort (API needs to know where and how to set the NULLs in the ARRAY[] while constructing the INSERT query)
    • no change in network traffic
    • persistent storage = 35% of the raw size

Conclusion: Start with the EXTENDED/MAIN storage parameter. If you have some free time investigate your data and use my test setup with your sparsity level. But the effect may be lower than you expect.

I suggest always to use the matrix serialization (e.g. Row-major order) plus two integer columns for the matrix dimensions NxM. Since most APIs use textual SQL you are saving a lot of network traffic and client memory for nested "ARRAY[ARRAY[..], ARRAY[..], ARRAY[..], ARRAY[..], ..]" !!!

Tebas


CREATE TABLE _testschema.matrix_dense
(
  matdata double precision[]
);
ALTER TABLE _testschema.matrix_dense ALTER COLUMN matdata SET STORAGE EXTERN;


CREATE TABLE _testschema.matrix_sparse_autocompressed
(
  matdata double precision[]
);

CREATE TABLE _testschema.matrix_sparse_bitmap
(
  matdata double precision[]
  bitmap bit varying(8000000)
);

Insert the same matrices into all tables. The concrete data depends on the certain table. Do not change the data on server side due to unused but allocated pages. Or do a VACUUM.

SELECT 
pg_total_relation_size('_testschema.matrix_dense') AS dense, 
pg_total_relation_size('_testschema.matrix_sparse_autocompressed') AS autocompressed, 
pg_total_relation_size('_testschema.matrix_sparse_bitmap') AS bitmap;
like image 63
Tebas Avatar answered Nov 12 '22 22:11

Tebas


A few solutions spring to mind,

1) Separate your features into groups that are usually set together, create a table for each group with a one-to-one foreign key relationship to the main data, only join on tables you need when querying

2) Use the EAV anti-pattern, create a 'feature' table with a foreign key field from your primary table as well as a fieldname and a value column, and store the features as rows in that table instead of as attributes in your primary table

3) Similarly to how PostgreDynamic does it, create a table for each 'column' in your primary table (they use a separate namespace for those tables), and create functions to simplify (as well as efficiently index) accessing and updating the data in those tables

4) create a column in your primary data using XML, or VARCHAR, and store some structured text format within it representing your data, create indexes over the data with functional indexes, write functions to update the data (or use the XML functions if you are using that format)

5) use the contrib/hstore module to create a column of type hstore that can hold key-value pairs, and can be indexed and updated

6) live with lots of empty fields

like image 9
MkV Avatar answered Nov 12 '22 20:11

MkV


A NULL value will take up no space when it's NULL. It'll take up one bit in a bitmap in the tuple header, but that will be there regardless.

However, the system can't deal with millions of columns, period. There is a theoretical max of a bit over a thousand, IIRC, but you really don't want to go that far.

If you really need that many, in a single table, you need to go the EAV method, which is basically what you're saying in (2).

If each entry has only a relatively few keys, I suggest you look at the "hstore" contrib modules which lets you store this type of data very efficiently, as a third option. It's been enhanced further in the upcoming 9.0 version, so if you are a bit away from production deployment, you might want to look directly at that one. However, it's well worth it in 8.4 as well. And it does support some pretty efficient index based lookups. Definitely worth looking into.

like image 3
Magnus Hagander Avatar answered Nov 12 '22 21:11

Magnus Hagander


I know this is an old thread, but MadLib provides a sparse vector type for Postgres, along with several machine learning and statistical methods.

like image 2
Brian Dolan Avatar answered Nov 12 '22 21:11

Brian Dolan