Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Postgresql index on xpath expression gives no speed up

We are trying to create OEBS-analog functionality in Postgresql. Let's say we have a form constructor and need to store form results in database (e.g. email bodies). In Oracle you could use a table with 150~ columns (and some mapping stored elsewhere) to store each field in separate column. But in contrast to Oracle we would like to store all the form in postgresql xml field. The example of the tree is

 <row xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">  
   <object_id>2</object_id>  
   <pack_form_id>23</pack_form_id>  
   <prod_form_id>34</prod_form_id>
 </row>

We would like to search through this field. Test table contains 400k rows and the following select executes in 90 seconds:

select * 
from params 
where (xpath('//prod_form_id/text()'::text, xmlvalue))[1]::text::int=34

So I created this index:

create index prod_form_idx 
ON params using btree(
   ((xpath('//prod_form_id/text()'::text, xmlvalue))[1]::text::int)
);

And it made no difference. Still 90 seconds execution. EXPLAIN plan show this:

Bitmap Heap Scan on params  (cost=40.29..6366.44 rows=2063 width=292)
  Recheck Cond: ((((xpath('//prod_form_id/text()'::text, xmlvalue, '{}'::text[]))[1])::text)::integer = 34)
  ->  Bitmap Index Scan on prod_form_idx  (cost=0.00..39.78 rows=2063 width=0)
        Index Cond: ((((xpath('//prod_form_id/text()'::text, xmlvalue, '{}'::text[]))[1])::text)::integer = 34)

I am not the great plan interpreter so I suppose this means that index is being used. The question is: where's all the speed? And what can i do in order to optimize this kind of queries?

like image 655
gorodechnyj Avatar asked Jun 28 '11 09:06

gorodechnyj


1 Answers

Well, at least the index is used. You get a bitmap index scan instead of a normal index scan though, which means the xpath() function will be called lots of times.

Let's do a little check :

CREATE TABLE foo ( id serial primary key, x xml, h hstore );
insert into foo (x,h) select XMLPARSE( CONTENT '<row  xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">  
   <object_id>2</object_id>  
   <pack_form_id>' || n || '</pack_form_id>  
   <prod_form_id>34</prod_form_id>
 </row>' ), 
('object_id=>2,prod_form_id=>34,pack_form_id=>'||n)::hstore 
FROM generate_series( 1,100000 ) n;

test=> EXPLAIN ANALYZE SELECT count(*) FROM foo;
                                                   QUERY PLAN                                                    
-----------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=4821.00..4821.01 rows=1 width=0) (actual time=24.694..24.694 rows=1 loops=1)
   ->  Seq Scan on foo  (cost=0.00..4571.00 rows=100000 width=0) (actual time=0.006..13.996 rows=100000 loops=1)
 Total runtime: 24.730 ms

test=> explain analyze select * from foo where (h->'pack_form_id')='123';
                                             QUERY PLAN                                             
----------------------------------------------------------------------------------------------------
 Seq Scan on foo  (cost=0.00..5571.00 rows=500 width=68) (actual time=0.075..48.763 rows=1 loops=1)
   Filter: ((h -> 'pack_form_id'::text) = '123'::text)
 Total runtime: 36.808 ms

test=> explain analyze select * from foo where ((xpath('//pack_form_id/text()'::text, x))[1]::text) = '123';
                                              QUERY PLAN                                              
------------------------------------------------------------------------------------------------------
 Seq Scan on foo  (cost=0.00..5071.00 rows=500 width=68) (actual time=4.271..3368.838 rows=1 loops=1)
   Filter: (((xpath('//pack_form_id/text()'::text, x, '{}'::text[]))[1])::text = '123'::text)
 Total runtime: 3368.865 ms

As we can see,

  • scanning the whole table with count(*) takes 25 ms
  • extracting one key/value from a hstore adds a small extra cost, about 0.12 µs/row
  • extracting one key/value from a xml using xpath adds a huge cost, about 33 µs/row

Conclusions :

  • xml is slow (but everyone knows that)
  • if you want to put a flexible key/value store in a column, use hstore

Also since your xml data is pretty big it will be toasted (compressed and stored out of the main table). This makes the rows in the main table much smaller, hence more rows per page, which reduces the efficiency of bitmap scans since all rows on a page have to be rechecked.

You can fix this though. For some reason the xpath() function (which is very slow, since it handles xml) has the same cost (1 unit) as say, the integer operator "+"...

update pg_proc set procost=1000 where proname='xpath';

You may need to tweak the cost value. When given the right info, the planner knows xpath is slow and will avoid a bitmap index scan, using an index scan instead, which doesn't need rechecking the condition for all rows on a page.

Note that this does not solve the row estimates problem. Since you can't ANALYZE the inside of the xml (or hstore) you get default estimates for the number of rows (here, 500). So, the planner may be completely wrong and choose a catastrophic plan if some joins are involved. The only solution to this is to use proper columns.

like image 161
bobflux Avatar answered Oct 29 '22 10:10

bobflux