Postgres 13.4.
I've got a setup where a Bloom index on a mixture of field types, including uuid could prove very helpful. Out of the box, the Bloom extension supports only int4 and text. How do you add support for additional types?
Per Laurenz Albe's comments, I've reworked the post into a Q & A format.
Every couple of years, I work through that fantastic ten-part series on indexing from the folks at Postgres Professional. I was looking at the final piece on Bloom filters (https://habr.com/en/company/postgrespro/blog/452968/), and spotted an example of how to extended support to additional data types. It caught my eye as I've not figured out CREATE OPERATOR CLASS before. I also haven't found these details mentioned in other discussion of Postgres' Bloom indexes. I figured I'd write up some notes for the sake of the archives and, hopefully, to attract some additional information and/or corrections.
The indexing options in Postgres are definitely one of its ten top five features. The variety of basic index types/frameworks and variations are incredible. Unfortunately, the index-related parts of the pg_catalog are pretty dense. If you've got something you want to index, and the data type isn't supported, it seems that you can add support for additional types with CREATE OPERATOR CLASS. I'd never figured out how to do this, until seeing the examples at https://habr.com/en/company/postgrespro/blog/452968/. While this example, and the code below, are specifically for Bloom filter indexes, I'm hoping that most of the mechanics transfer over to other index types.
I don't need a Bloom filter today, and may never want one. But, let's be honest, they are deeply cool*. And a great feature to know about in Postgres because they solve a problem that you would not normally expect an RDBMS to have a solution for. (Wide columns, random search combinations across many, many, fields.)
I've not seen Bloom filters in RDMBS systems before. I have seen them in in distributed log aggregation systems, where they're helpful at reducing unnecessary and expensive loads of data from remote locations, or from cold storage. Postgres Bloom filters are different than that.
I might have a good use for a Bloom index as we've got a wide table with 13 UUID columns that are searched on in any number of combinations, along with a variety of int4, and other data. I like to experiment with this stuff in advance so that, if a problem comes up, I already have possible tools in mind. And, a Bloom filter is not what a DBA would normally think of immediately. However, a Bloom index is a great example case for CREATE OPERATOR CLASS. To get it up and running requires only three steps:
Install the bloom extension, if it isn't already installed.
Define a new operator class for the data type of interest and the bloom access method. This requires finding a suitable hashing function.
Create your new index using the new operator class.
Assuming you've installed the bloom extension (https://www.postgresql.org/docs/current/bloom.html), here's a little search to show the operator classes associated with bloom indexes:
select pg_am.amname,
pg_opclass.opcintype::regtype,
pg_opclass.opcname,
pg_am.amhandler::regproc
from pg_opclass
left join pg_am on pg_am.oid = pg_opclass.opcmethod
where amname = 'bloom'
order by amname,opcname;
+--------+-----------+---------------+-----------+
| amname | opcintype | opcname | amhandler |
+--------+-----------+---------------+-----------+
| bloom | integer | int4_ops | blhandler |
| bloom | text | text_ops | blhandler |
+--------+-----------+---------------+-----------+
So, no help for uuid data.
Here's the code needed for this simple case:
CREATE OPERATOR CLASS uuid_ops
DEFAULT FOR TYPE uuid USING bloom AS
OPERATOR 1 =(uuid,uuid),
FUNCTION 1 uuid_hash;
I think that the code above means "OPERATOR strategy 1 EQUALITY hash", more-or-less. I'm guessing that adding an operator class to an index type with a wider range of operators is more involved. Bloom filters only support "might exist/does not exist" logic, so there's just the one declaration needed, and the hash function to match. If you run the previous again, it now returns an entry for the new operator class:
+--------+-----------+---------------+-----------+
| amname | opcintype | opcname | amhandler |
+--------+-----------+---------------+-----------+
| bloom | integer | int4_ops | blhandler |
| bloom | text | text_ops | blhandler |
| bloom | uuid | uuid_ops | blhandler |
+--------+-----------+---------------+-----------+
I was told some time back that while not listed in the docs, the Postgres source code includes a wide range of type-specific hash functions. There's a whole stack of type-specific function implemented for hash indexes:
https://doxygen.postgresql.org/hashfunc_8c_source.html
I don't see uuid in there, but a hash is implemented here:
https://github.com/postgres/postgres/blob/REL_13_STABLE/src/backend/utils/adt/uuid.c#L403
All you need to know is the name of the function to supply CREATE OPERATOR CLASS, not where it's implemented in the source.
With this in place, a Bloom index can be created on uuid data. Is this a good idea? No idea. Do I need a Bloom index today? No. But I like to work through Postgres' features to know what's available when I do have a suitable problem. Indexes in Postgres are so much more extensive than in anything else I know, or have seen...they take some study. Below is an example that uses the new operator class on a Bloom index that includes ten uuid fields followed by three int4 fields.
create index analytic_work_bloom_idx on data.analytic_work
using bloom(
id,
hsys_id,
facility_id,
inv_id,
user_id,
activity_id,
assembly_id,
q_event_id,
scan_id,
scase_id,
expected_count,
actual_count,
missing_count)
with (length=96,
col1 = 7,
col2 = 7,
col3 = 7,
col4 = 7,
col5 = 7,
col6 = 7,
col7 = 7,
col8 = 7,
col9 = 7,
col10 = 7,
col11 = 7,
col12 = 7,
col13 = 7);
select pg_size_pretty(pg_total_relation_size('analytic_work_bloom_idx')); --- Curious.
Don't take the example above as a great example of how to figure the right signature length and bytes-per-column for a Bloom index. There are lots of articles out there explaining the mechanics, the theory, and proposing methods for tuning the signature length and bytes-per-column. Past my pay grade.
I had about 600K+ rows in a sample database, and added the Bloom index. Results should be pretty variable but, in my case, I found that:
bloom index on 13 fields was about 4x larger than a B-tree on one uuid field.I'd be glad to see any sort of extra information or corrections that people have to offer.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With