Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Are postgres JSON indexes efficient enough compared with classic normalized tables?

Current Postgresql versions have introduced various features for JSON content, but I'm concerned if I really should use them - I mean, there is not yet "best practice" estabilished on what works and what doesn't, or at least I can't find it.

I have a specific example - I have a table about objects which, among other things, contains a list of alternate names for that object. All that data will also be included in a JSON column for retrieval purposes. For example (skipping all the other irrelevant fields).

create table stuff (id serial primary key, data json); insert into stuff(data) values('{"AltNames":["Name1","Name2","Name3"]}') 

I will need some queries in the form "list all objects where one of altnames is 'foobar'." The expected table size is on the order of a few million records. Postgres JSON queries can be used for that, and it can also be indexed (Index for finding an element in a JSON array, for example). However, SHOULD it be done that way or is it a perverse workaround that's not recommended?

The classic alternative, of course, is to add an additional table for that one-to-many relation, containing the name and a foreign key to the main table; the performance of that is well understood. However, that has it's own disadvantages as then it means either data duplication between that table and JSON (with possible integrity risk); or creating that JSON return data dynamically at every request, which has it's own performance penalty.

like image 638
Peteris Avatar asked Sep 14 '13 08:09

Peteris


People also ask

Is Postgres JSON efficient?

It stores parsed JSON in a binary format, so queries are efficient. Accessing JSONB values is about 2× slower than accessing a BYTEA column. Queries on HSTORE values are slightly faster (~10-20%), so if performance is critical and string key/value pairs are sufficient, it is worth considering.

Should you use JSON in Postgres?

Follow these guidelines when you consider using JSON in PostgreSQL: Don't use JSON for data that can easily be stored in database tables. Avoid large JSON objects if you want to modify individual attributes. Don't use JSON if you want to use attributes in complicated WHERE conditions.

Is Jsonb faster than JSON?

Json processes input faster than jsonb as there is no conversion involved in this. Jsonb converts the JSON data into the binary form so it has slightly slower input due to the binary conversion overhead. There is no change in the Schema design while working with JSON.

Is Postgres Jsonb fast?

Speed. Because JSONB stores data in a binary format, queries process significantly faster. Storing data in binary form allows Postgres to access a particular JSON key-value pair without reading the entire JSON record.


2 Answers

I will need some queries in the form "list all objects where one of altnames is 'foobar'." The expected table size is on the order of a few million records. Postgres JSON queries can be used for that, and it can also be indexed (Index For Finding Element in JSON array, for example). However, SHOULD it be done that way or is it a perverse workaround that's not recommended?

It can be done that way but that doesn't mean that you should. In some sense, the best practice is well documented already (see e.g. using hstore vs using XML vs using EAV vs using a separate table) with a new datatype which, for all intents and practical purposes (besides validation and syntax), is no different from prior unstructured or semi-structured options.

Put another way, it's the same old pig with new makeup.

JSON offers the ability to use inverted search tree indexes, in the same way as hstore, array types and tsvectors do. They work fine, but keep in mind that they're primarily designed for extracting points in a neighborhood (think geometry types) ordered by distance, rather than for extracting a list of values in lexicographical order.

To illustrate, take the two plans that Roman's answer outlines:

  • The one that does an index scan plows through disk pages directly, retrieving the rows in the order indicated by the index.
  • The one that does a bitmap index scan starts by identifying every disk page that might contain a row, and reads them as they appear on disk, as if it was (and in fact, precisely like) doing a sequence scan that skips useless areas.

Getting back to your question: Cluttered and oversized inverted tree indexes will indeed improve the performance of your app if you use Postgres tables as giant JSON stores. But they're not a silver bullet either, and they won't get you as far as proper relational design when dealing with bottlenecks.

The bottom line, in the end, is no different from what you'd get when deciding to use hstore or an EAV:

  1. If it needs an index (i.e. it frequently appears in a where clause or, even more importantly, in a join clause), you likely want the data in a separate field.
  2. If it's primarily cosmetic, JSON/hstore/EAV/XML/whatever-makes-you-sleep-at-night works fine.
like image 97
Denis de Bernardy Avatar answered Sep 28 '22 16:09

Denis de Bernardy


I'd say it worth a try. I've created some test (100000 records, ~10 elements in JSON array) and checked how it's working:

create table test1 (id serial primary key, data json); create table test1_altnames (id int, name text);  create or replace function array_from_json(_j json) returns text[] as $func$     select array_agg(x.elem::text)     from json_array_elements(_j) as x(elem) $func$ language sql immutable;  with cte as (     select         (random() * 100000)::int as grp, (random() * 1000000)::int as name     from generate_series(1, 1000000) ), cte2 as (     select         array_agg(Name) as "AltNames"     from cte     group by grp ) insert into test1 (data) select row_to_json(t) from cte2 as t  insert into test1_altnames (id, name) select id, json_array_elements(data->'AltNames')::text from test1  create index ix_test1 on test1 using gin(array_from_json(data->'AltNames')); create index ix_test1_altnames on test1_altnames (name); 

Query JSON (30ms on my machine):

select * from test1 where '{489147}' <@ array_from_json(data->'AltNames');  "Bitmap Heap Scan on test1  (cost=224.13..1551.41 rows=500 width=36)" "  Recheck Cond: ('{489147}'::text[] <@ array_from_json((data -> 'AltNames'::text)))" "  ->  Bitmap Index Scan on ix_test1  (cost=0.00..224.00 rows=500 width=0)" "        Index Cond: ('{489147}'::text[] <@ array_from_json((data -> 'AltNames'::text)))" 

Query table with names (15ms on my machine):

select * from test1 as t where t.id in (select tt.id from test1_altnames as tt where tt.name = '489147');  "Nested Loop  (cost=12.76..20.80 rows=2 width=36)" "  ->  HashAggregate  (cost=12.46..12.47 rows=1 width=4)" "        ->  Index Scan using ix_test1_altnames on test1_altnames tt  (cost=0.42..12.46 rows=2 width=4)" "              Index Cond: (name = '489147'::text)" "  ->  Index Scan using test1_pkey on test1 t  (cost=0.29..8.31 rows=1 width=36)" "        Index Cond: (id = tt.id)" 

Also I have to note, there're some cost to insert/delete rows into table with names (test1_altnames), so it's a bit more complicated than just selecting rows. Personally I like solution with JSON.

like image 33
Roman Pekar Avatar answered Sep 28 '22 15:09

Roman Pekar