I have pretty complex JSONB stored in one jsonb
column.
DB table looks like:
CREATE TABLE sites (
id text NOT NULL,
doc jsonb,
PRIMARY KEY (id)
)
Data we are storing in doc
column is a complex nested JSONB
data:
{
"_id": "123",
"type": "Site",
"identification": "Custom ID",
"title": "SITE 1",
"address": "UK, London, Mr Tom's street, 2",
"buildings": [
{
"uuid": "12312",
"identification": "Custom ID",
"name": "BUILDING 1",
"deposits": [
{
"uuid": "12312",
"identification": "Custom ID",
"audits": [
{
"uuid": "12312",
"sample_id": "SAMPLE ID"
}
]
}
]
}
]
}
So structure of my JSONB
looks like:
SITE
-> ARRAY OF BUILDINGS
-> ARRAY OF DEPOSITS
-> ARRAY OF AUDITS
We need to implement full text search by some values in each of type of entry:
SITE (identification, title, address)
BUILDING (identification, name)
DEPOSIT (identification)
AUDIT (sample_id)
SQL query should run a full text search in these field values only.
I guess need to use GIN
indexes and something like tsvector
, but do not have enough Postgresql background.
So, my question is it possible to index and then query such nested JSONB
structures?
In PostgreSQL, you use two functions to perform Full Text Search. They are to_tsvector() and to_tsquery(). Let's see how they work and to use them first. to_tsvector() function breaks up the input string and creates tokens out of it, which is then used to perform Full Text Search using the to_tsquery() function.
Querying the JSON document PostgreSQL has two native operators -> and ->> to query JSON documents. The first operator -> returns a JSON object, while the operator ->> returns text. These operators work on both JSON as well as JSONB columns. There are additional operators available for JSONB columns.
Introduction. Full-text search (FTS) is a technique used by search engines to find results in a database. It can be used to power search results on websites like shops, search engines, newspapers, and more.
JSONB and Indexes PostgreSQL can use indexes for the text results as compare operands. GIN index can be used by the GIN JSONB operator class.
Let's add new column of tsvector
type:
alter table sites add column tsvector tsvector;
Now let's create a trigger that will collect lexems, organize them and put to our tsvector. We will use 4 groups (A, B, C, D) -- this a special tsvector's feature that allows you to distinguish lexems later, at search time (see examples in manual https://www.postgresql.org/docs/current/static/textsearch-controls.html; unfortunately, this feature support only up to 4 groups becase developers reserved only 2 bits for that, but we are lucky here, we need only 4 groups):
create or replace function t_sites_tsvector() returns trigger as $$
declare
dic regconfig;
part_a text;
part_b text;
part_c text;
part_d text;
begin
dic := 'simple'; -- change if you need more advanced word processing (stemming, etc)
part_a := coalesce(new.doc->>'identification', '') || ' ' || coalesce(new.doc->>'title', '') || ' ' || coalesce(new.doc->>'address', '');
select into part_b string_agg(coalesce(a, ''), ' ') || ' ' || string_agg(coalesce(b, ''), ' ')
from (
select
jsonb_array_elements((new.doc->'buildings'))->>'identification',
jsonb_array_elements((new.doc->'buildings'))->>'name'
) _(a, b);
select into part_c string_agg(coalesce(c, ''), ' ')
from (
select jsonb_array_elements(b)->>'identification' from (
select jsonb_array_elements((new.doc->'buildings'))->'deposits'
) _(b)
) __(c);
select into part_d string_agg(coalesce(d, ''), ' ')
from (
select jsonb_array_elements(c)->>'sample_id'
from (
select jsonb_array_elements(b)->'audits' from (
select jsonb_array_elements((new.doc->'buildings'))->'deposits'
) _(b)
) __(c)
) ___(d);
new.tsvector := setweight(to_tsvector(dic, part_a), 'A')
|| setweight(to_tsvector(dic, part_b), 'B')
|| setweight(to_tsvector(dic, part_c), 'C')
|| setweight(to_tsvector(dic, part_d), 'D')
;
return new;
end;
$$ language plpgsql immutable;
create trigger t_sites_tsvector
before insert or update on sites for each row execute procedure t_sites_tsvector();
^^ -- scroll it, this snippet is bigger than it looks (especially of you have MacOS w/o scrollbars...)
Now let's create GIN index to speedup search queries (makes sense if you have many rows -- say, more than hundreds or thousands):
create index i_sites_fulltext on sites using gin(tsvector);
And now we insert something to check:
insert into sites select 1, '{
"_id": "123",
"type": "Site",
"identification": "Custom ID",
"title": "SITE 1",
"address": "UK, London, Mr Tom''s street, 2",
"buildings": [
{
"uuid": "12312",
"identification": "Custom ID",
"name": "BUILDING 1",
"deposits": [
{
"uuid": "12312",
"identification": "Custom ID",
"audits": [
{
"uuid": "12312",
"sample_id": "SAMPLE ID"
}
]
}
]
}
]
}'::jsonb;
Check with select * from sites;
– you must see that tsvector
column is filled with some data.
Now let's query it:
select * from sites where tsvector @@ to_tsquery('simple', 'sample');
-- it must return our record. In this case, we search for 'sample'
word and we don't care in which group it will be found.
Let's change it and try to search only in group A ("SITE (identification, title, address)" as you described it):
select * from sites where tsvector @@ to_tsquery('simple', 'sample:A');
-- this must return nothing because word 'sample'
sits only in group D ("AUDIT (sample_id)"). Indeed:
select * from sites where tsvector @@ to_tsquery('simple', 'sample:D');
-- will again return us our record.
Notice, that you need to use to_tsquery(..)
, not plainto_tsquery(..)
to be able to address 4 groups. So you need to sanitize your input yourself (avoid using or remove special characters like &
and |
because they have special meaning in tsquery
values).
And the good news is that you can combine different groups in a single query, like this:
select * from sites where tsvector @@ to_tsquery('simple', 'sample:D & london:A');
The other way to go (e.g. if you have to work with more than 4 groups) is having multiple tsvectors, each one sitting in a separate column, build them using single query, create index (you can create single index on multiple tsvector
columns) and query addressing separate columns. It's similar to what I explained above, but perhaps less efficient.
Hope this helps.
Things seem a little simpler in Postgres 10, as the to_tsvector function supports json. So for example this works nicely:
UPDATE dataset SET search_vector = to_tsvector('english',
'{
"abstract":"Abstract goes here",
"useConstraints":"None",
"dataQuality":"Good",
"Keyword":"historic",
"topicCategory":"Environment",
"responsibleOrganisation":"HES"
}'::json)
where dataset_id = 4;
Note I haven't tried this on a deeply nested structure, but don't see why it wouldn't work
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With