Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is there a standard approach for dealing with unordered arrays (sets) in PostgreSQL?

I have a table that contains pairs of words in two separate columns. The order of the words is often important, but there are times when I simply want to aggregate based on the two words, regardless of order. Is there a simple way to treat two rows with the same words but with different orders (one row the opposite of the other) as the same "set"? In other words, treat:

apple orange
orange apple

as:

(apple,orange)
(apple,orange)
like image 862
seandavi Avatar asked Jul 28 '14 01:07

seandavi


People also ask

Does Postgres array maintain order?

Yes, order is preserved.

Should you use arrays in Postgres?

When you are considering portability (e.g. rewriting your system to work with other databses) then you must not use arrays. If you are sure you'll stick with Postgres, then you can safely use arrays where you find appropriate. They exist for a reason and are neither bad design nor non-compliant.

How do I select an array in PostgreSQL?

PostgreSQL allows us to define a table column as an array type. The array must be of a valid data type such as integer, character, or user-defined types. To insert values into an array column, we use the ARRAY constructor.

What is text [] in PostgreSQL?

PostgreSQL supports a character data type called TEXT. This data type is used to store character of unlimited length. It is represented as text in PostgreSQL. The performance of the varchar (without n) and text are the same. Syntax: variable_name TEXT.


1 Answers

There's no built-in way at this time.

As arrays

If you consistently normalize them on save you can treat arrays as sets, by always storing them sorted and de-duplicated. It'd be great if PostgreSQL had a built-in C function to do this, but it doesn't. I took a look at writing one but the C array API is horrible, so even though I've written a bunch of extensions I just backed carefully away from this one.

If you don't mind moderately icky performance you can do it in SQL:

CREATE OR REPLACE FUNCTION array_uniq_sort(anyarray) RETURNS anyarray AS $$
SELECT array_agg(DISTINCT f ORDER BY f) FROM unnest($1) f;
$$ LANGUAGE sql IMMUTABLE;

then wrap all saves in calls to array_uniq_sort or enforce it with a trigger. You can then just compare your arrays for equality. You could avoid the array_uniq_sort calls for data from the app if you instead just did the sort/unique on the app side instead.

If you do this please store your "sets" as array columns, like text[], not comma- or space-delimited text. See this question for some of the reasons.

You need to watch out for a few things, like the fact that casts between arrays are stricter than casts between their base types. E.g.:

regress=> SELECT 'a' = 'a'::varchar, 'b' = 'b'::varchar;
 ?column? | ?column? 
----------+----------
 t        | t
(1 row)

regress=> SELECT ARRAY['a','b'] = ARRAY['a','b']::varchar[];
ERROR:  operator does not exist: text[] = character varying[]
LINE 1: SELECT ARRAY['a','b'] = ARRAY['a','b']::varchar[];
                              ^
HINT:  No operator matches the given name and argument type(s). You might need to add explicit type casts.
regress=> SELECT ARRAY['a','b']::varchar[] = ARRAY['a','b']::varchar[];
 ?column? 
----------
 t
(1 row)

Such columns are GiST-indexable for operations like array-contains or array-overlaps; see the PostgreSQL documentation on array indexing.

As normalized rows

The other option is to just store normalized rows with a suitable key. I'd still use array_agg for sorting and comparing them, as SQL set operations can be clunky to use for this (especially given the lack of an XOR / double-sided set difference operation).

This is generally known as EAV (entity-attribute-value). I'm not a fan myself, but it does have its place occasionally. Except you'd be using it without the value component.

You create a table:

CREATE TABLE item_attributes (
    item_id integer references items(id),
    attribute_name text,
    primary key(item_id, attribute_name)
);

and insert a row for each set entry for each item, instead of having each item have an array-valued column. The unique constraint enforced by the primary key ensures that no item may have duplicates of a given attribute. Attribute ordering is irrelevant/undefined.

Comparisions can be done with SQL set operators like EXCEPT, or using array_agg(attribute_name ORDER BY attribute_name) to form consistently sorted arrays for comparison.

Indexing is limited to determining whether a given item has/doesn't have a given attribute.

Personally I'd use arrays over this approach.

hstore

You can also use hstores with empty values to store sets, as hstore de-duplicates keys. 9.4's jsonb will also work for this.

regress=# create extension hstore;
CREATE EXTENSION
regress=# SELECT hstore('a => 1, b => 1') = hstore('b => 1, a => 1, b => 1');
 ?column? 
----------
 t
(1 row)

It's only really useful for text types, though. e.g.:

regress=# SELECT hstore('"1.0" => 1, "2.0" => 1') = hstore('"1.00" => 1, "1.000" => 1, "2.0" => 1');
 ?column? 
----------
 f
(1 row)

and I think it's ugly. So again, I'd favour arrays.

For integer arrays only

The intarray extension provides useful, fast functions for treating arrays as sets. They're only available for integer arrays but they're really useful.

like image 136
Craig Ringer Avatar answered Nov 10 '22 16:11

Craig Ringer