Is there a standard approach for dealing with unordered arrays (sets) in PostgreSQL?

Tags:

I have a table that contains pairs of words in two separate columns. The order of the words is often important, but there are times when I simply want to aggregate based on the two words, regardless of order. Is there a simple way to treat two rows with the same words but with different orders (one row the opposite of the other) as the same "set"? In other words, treat:

apple orange
orange apple

as:

(apple,orange)
(apple,orange)

862

asked Jul 28 '14 01:07

seandavi

1 Answers

There's no built-in way at this time.

As arrays

If you consistently normalize them on save you can treat arrays as sets, by always storing them sorted and de-duplicated. It'd be great if PostgreSQL had a built-in C function to do this, but it doesn't. I took a look at writing one but the C array API is horrible, so even though I've written a bunch of extensions I just backed carefully away from this one.

If you don't mind moderately icky performance you can do it in SQL:

CREATE OR REPLACE FUNCTION array_uniq_sort(anyarray) RETURNS anyarray AS $$
SELECT array_agg(DISTINCT f ORDER BY f) FROM unnest($1) f;
$$ LANGUAGE sql IMMUTABLE;

then wrap all saves in calls to array_uniq_sort or enforce it with a trigger. You can then just compare your arrays for equality. You could avoid the array_uniq_sort calls for data from the app if you instead just did the sort/unique on the app side instead.

If you do this please store your "sets" as array columns, like text[], not comma- or space-delimited text. See this question for some of the reasons.

You need to watch out for a few things, like the fact that casts between arrays are stricter than casts between their base types. E.g.:

regress=> SELECT 'a' = 'a'::varchar, 'b' = 'b'::varchar;
 ?column? | ?column? 
----------+----------
 t        | t
(1 row)

regress=> SELECT ARRAY['a','b'] = ARRAY['a','b']::varchar[];
ERROR:  operator does not exist: text[] = character varying[]
LINE 1: SELECT ARRAY['a','b'] = ARRAY['a','b']::varchar[];
                              ^
HINT:  No operator matches the given name and argument type(s). You might need to add explicit type casts.
regress=> SELECT ARRAY['a','b']::varchar[] = ARRAY['a','b']::varchar[];
 ?column? 
----------
 t
(1 row)

Such columns are GiST-indexable for operations like array-contains or array-overlaps; see the PostgreSQL documentation on array indexing.

As normalized rows

The other option is to just store normalized rows with a suitable key. I'd still use array_agg for sorting and comparing them, as SQL set operations can be clunky to use for this (especially given the lack of an XOR / double-sided set difference operation).

This is generally known as EAV (entity-attribute-value). I'm not a fan myself, but it does have its place occasionally. Except you'd be using it without the value component.

You create a table:

CREATE TABLE item_attributes (
    item_id integer references items(id),
    attribute_name text,
    primary key(item_id, attribute_name)
);

and insert a row for each set entry for each item, instead of having each item have an array-valued column. The unique constraint enforced by the primary key ensures that no item may have duplicates of a given attribute. Attribute ordering is irrelevant/undefined.

Comparisions can be done with SQL set operators like EXCEPT, or using array_agg(attribute_name ORDER BY attribute_name) to form consistently sorted arrays for comparison.

Indexing is limited to determining whether a given item has/doesn't have a given attribute.

Personally I'd use arrays over this approach.

hstore

You can also use hstores with empty values to store sets, as hstore de-duplicates keys. 9.4's jsonb will also work for this.

regress=# create extension hstore;
CREATE EXTENSION
regress=# SELECT hstore('a => 1, b => 1') = hstore('b => 1, a => 1, b => 1');
 ?column? 
----------
 t
(1 row)

It's only really useful for text types, though. e.g.:

regress=# SELECT hstore('"1.0" => 1, "2.0" => 1') = hstore('"1.00" => 1, "1.000" => 1, "2.0" => 1');
 ?column? 
----------
 f
(1 row)

and I think it's ugly. So again, I'd favour arrays.

For integer arrays only

The intarray extension provides useful, fast functions for treating arrays as sets. They're only available for integer arrays but they're really useful.

136

answered Nov 10 '22 16:11

Craig Ringer

Related questions
                            
                                Generate unique random numbers using SQL
                            
                                Accessing data with DAL2: Value cannot be null.Parameter name: con
                            
                                How to multiply a single row with a number from column in sql
                            
                                Symfony2 and Doctrine: OneToMany relation duplicates data in the table
                            
                                how to execute SQL statements in command prompt (CMD) [closed]
                            
                                Find all foreign keys constraints in database referencing a certain primary key [duplicate]
                            
                                Query returns "LIKE" results despite not having wildcards?
                            
                                sql linked server join query
                            
                                Display the results of a SELECT COUNT(*) in PHP
                            
                                how to create a Foreign-Key constraint to a subset of the rows of a table?
                            
                                sql server displaying missing dates
                            
                                How to see queries executing on Informix DB
                            
                                MySQL LOAD DATA Error (Errcode: 2 - "No such file or directory")
                            
                                SQL Server (0x80131904): String or binary data would be truncated [duplicate]
                            
                                How to order the fields getting aggregated by XMLAGG function?
                            
                                Is there sql WITH clause equivalent in hive?
                            
                                How to use GROUP BY to concatenate strings while joining multiple tables?
                            
                                Remove all duplicates from Redshift database
                            
                                How to find most overlapping time period with date ranges
                            
                                Find Partition Schema Definitions in SQL Server Database

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With