Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to implement full text search on complex nested JSONB in Postgresql

I have pretty complex JSONB stored in one jsonb column.

DB table looks like:

 CREATE TABLE sites (
   id text NOT NULL,
   doc jsonb,
   PRIMARY KEY (id)
 )

Data we are storing in doc column is a complex nested JSONB data:

   {
      "_id": "123",
      "type": "Site",
      "identification": "Custom ID",
      "title": "SITE 1",
      "address": "UK, London, Mr Tom's street, 2",
      "buildings": [
          {
               "uuid": "12312",
               "identification": "Custom ID",
               "name": "BUILDING 1",
               "deposits": [
                   {
                      "uuid": "12312",
                      "identification": "Custom ID",             
                      "audits": [
                          {
                             "uuid": "12312",         
                              "sample_id": "SAMPLE ID"                
                          }
                       ]
                   }
               ]
          } 
       ]
    }

So structure of my JSONB looks like:

SITE 
  -> ARRAY OF BUILDINGS
     -> ARRAY OF DEPOSITS
       -> ARRAY OF AUDITS

We need to implement full text search by some values in each of type of entry:

SITE (identification, title, address)
BUILDING (identification, name)
DEPOSIT (identification)
AUDIT (sample_id)

SQL query should run a full text search in these field values only.

I guess need to use GIN indexes and something like tsvector, but do not have enough Postgresql background.

So, my question is it possible to index and then query such nested JSONB structures?

like image 446
rusllonrails Avatar asked Aug 14 '17 18:08

rusllonrails


People also ask

How do I use full text search in PostgreSQL?

In PostgreSQL, you use two functions to perform Full Text Search. They are to_tsvector() and to_tsquery(). Let's see how they work and to use them first. to_tsvector() function breaks up the input string and creates tokens out of it, which is then used to perform Full Text Search using the to_tsquery() function.

How do I search a Jsonb column in PostgreSQL?

Querying the JSON document PostgreSQL has two native operators -> and ->> to query JSON documents. The first operator -> returns a JSON object, while the operator ->> returns text. These operators work on both JSON as well as JSONB columns. There are additional operators available for JSONB columns.

What is FTS in PostgreSQL?

Introduction. Full-text search (FTS) is a technique used by search engines to find results in a database. It can be used to power search results on websites like shops, search engines, newspapers, and more.

Can you index Jsonb Postgres?

JSONB and Indexes PostgreSQL can use indexes for the text results as compare operands. GIN index can be used by the GIN JSONB operator class.


2 Answers

Let's add new column of tsvector type:

alter table sites add column tsvector tsvector;

Now let's create a trigger that will collect lexems, organize them and put to our tsvector. We will use 4 groups (A, B, C, D) -- this a special tsvector's feature that allows you to distinguish lexems later, at search time (see examples in manual https://www.postgresql.org/docs/current/static/textsearch-controls.html; unfortunately, this feature support only up to 4 groups becase developers reserved only 2 bits for that, but we are lucky here, we need only 4 groups):

create or replace function t_sites_tsvector() returns trigger as $$
declare
  dic regconfig;
  part_a text;
  part_b text;
  part_c text;
  part_d text;
begin
  dic := 'simple'; -- change if you need more advanced word processing (stemming, etc)

  part_a := coalesce(new.doc->>'identification', '') || ' ' || coalesce(new.doc->>'title', '') || ' ' || coalesce(new.doc->>'address', '');

  select into part_b string_agg(coalesce(a, ''), ' ') || ' ' || string_agg(coalesce(b, ''), ' ')
  from (
    select 
      jsonb_array_elements((new.doc->'buildings'))->>'identification',
      jsonb_array_elements((new.doc->'buildings'))->>'name'
  ) _(a, b);

  select into part_c string_agg(coalesce(c, ''), ' ')
  from (
    select jsonb_array_elements(b)->>'identification' from (
      select jsonb_array_elements((new.doc->'buildings'))->'deposits'
    ) _(b)
  ) __(c);

  select into part_d string_agg(coalesce(d, ''), ' ')
  from (
    select jsonb_array_elements(c)->>'sample_id'
    from (
      select jsonb_array_elements(b)->'audits' from (
        select jsonb_array_elements((new.doc->'buildings'))->'deposits'
      ) _(b)
    ) __(c)
  ) ___(d);

  new.tsvector := setweight(to_tsvector(dic, part_a), 'A')
    || setweight(to_tsvector(dic, part_b), 'B')
    || setweight(to_tsvector(dic, part_c), 'C')
    || setweight(to_tsvector(dic, part_d), 'D')
  ;
  return new;
end;
$$ language plpgsql immutable;

create trigger t_sites_tsvector
  before insert or update on sites for each row execute procedure t_sites_tsvector();

^^ -- scroll it, this snippet is bigger than it looks (especially of you have MacOS w/o scrollbars...)

Now let's create GIN index to speedup search queries (makes sense if you have many rows -- say, more than hundreds or thousands):

create index i_sites_fulltext on sites using gin(tsvector);

And now we insert something to check:

insert into sites select 1, '{
      "_id": "123",
      "type": "Site",
      "identification": "Custom ID",
      "title": "SITE 1",
      "address": "UK, London, Mr Tom''s street, 2",
      "buildings": [
          {
               "uuid": "12312",
               "identification": "Custom ID",
               "name": "BUILDING 1",
               "deposits": [
                   {
                      "uuid": "12312",
                      "identification": "Custom ID",
                      "audits": [
                          {
                             "uuid": "12312",
                              "sample_id": "SAMPLE ID"
                          }
                       ]
                   }
               ]
          }
       ]
    }'::jsonb;

Check with select * from sites; – you must see that tsvector column is filled with some data.

Now let's query it:

select * from sites where tsvector @@ to_tsquery('simple', 'sample');

-- it must return our record. In this case, we search for 'sample' word and we don't care in which group it will be found.

Let's change it and try to search only in group A ("SITE (identification, title, address)" as you described it):

select * from sites where tsvector @@ to_tsquery('simple', 'sample:A');

-- this must return nothing because word 'sample' sits only in group D ("AUDIT (sample_id)"). Indeed:

select * from sites where tsvector @@ to_tsquery('simple', 'sample:D');

-- will again return us our record.

Notice, that you need to use to_tsquery(..), not plainto_tsquery(..) to be able to address 4 groups. So you need to sanitize your input yourself (avoid using or remove special characters like & and | because they have special meaning in tsquery values).

And the good news is that you can combine different groups in a single query, like this:

select * from sites where tsvector @@ to_tsquery('simple', 'sample:D & london:A');

The other way to go (e.g. if you have to work with more than 4 groups) is having multiple tsvectors, each one sitting in a separate column, build them using single query, create index (you can create single index on multiple tsvector columns) and query addressing separate columns. It's similar to what I explained above, but perhaps less efficient.

Hope this helps.

like image 123
Nick Avatar answered Sep 29 '22 10:09

Nick


Things seem a little simpler in Postgres 10, as the to_tsvector function supports json. So for example this works nicely:

UPDATE dataset SET search_vector = to_tsvector('english',
'{
  "abstract":"Abstract goes here",
  "useConstraints":"None",
  "dataQuality":"Good",
  "Keyword":"historic",
  "topicCategory":"Environment",
  "responsibleOrganisation":"HES"
}'::json)
where dataset_id = 4;

Note I haven't tried this on a deeply nested structure, but don't see why it wouldn't work

like image 30
Iain Hunter Avatar answered Sep 29 '22 11:09

Iain Hunter