Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Search in 300 million addresses with pg_trgm

I have 300 million addresses in my PostgreSQL 9.3 DB and I want to use pg_trgm to fuzzy search the rows. The final purpose is to implement a search function just like Google Map search.

When I used pg_trgm to search these addresses, it cost about 30s to get the results. There are many rows matching the default similarity threshold condition of 0.3 but I just need about 5 or 10 results. I created a trigram GiST index:

CREATE INDEX addresses_trgm_index ON addresses USING gist (address gist_trgm_ops);

This is my query:

SELECT address, similarity(address, '981 maun st') AS sml 
FROM addresses 
WHERE address % '981 maun st' 
ORDER BY sml DESC 
LIMIT 10;

The test table on production environment has been removed. I show the EXPLAIN output from my test environment. There are about 7 million rows and it needs about 1.6s to get results. With 300 million, it needs more than 30s.

ebdb=> explain analyse select address, similarity(address, '781 maun st') as sml from addresses where address % '781 maun st' order by sml desc limit 10;
                                    QUERY PLAN                                                                            
————————————————————————————————————————————————————————————————————————————————    
 Limit  (cost=7615.83..7615.86 rows=10 width=16) (actual time=1661.004..1661.010 rows=10 loops=1)
 ->  Sort  (cost=7615.83..7634.00 rows=7268 width=16) (actual time=1661.003..1661.005 rows=10 loops=1)
     Sort Key: (similarity((address)::text, '781 maun st'::text))
     Sort Method: top-N heapsort  Memory: 25kB
     ->  Index Scan using addresses_trgm_index on addresses  (cost=0.41..7458.78 rows=7268 width=16) (actual time=0.659..1656.386 rows=5241 loops=1)
           Index Cond: ((address)::text % '781 maun st'::text)
 Total runtime: 1661.066 ms
(7 rows)

Is there a good way to improve the performance or is it a good plan to do table partitioning?

like image 703
Gary Tao Avatar asked Jun 27 '17 06:06

Gary Tao


1 Answers

PostgreSQL 9.3 ... Is there a good way to improve the performance or is it a good plan to do table partitioning?

Table partitioning will not help at all.

But yes, there is a good way: Upgrade to a current version of Postgres. There have been many improvements for GiST indexes, for the pg_trgm module in particular and for big data in general. Should be substantially faster with Postgres 10.

Your "nearest neighbor" search looks correct but for a small LIMIT use this equivalent query instead:

SELECT address, similarity(address, '981 maun st') AS sml 
FROM   addresses 
WHERE  address % '981 maun st' 
ORDER  BY address &lt-> '981 maun st'
LIMIT  10;

Quoting the manual:

It will usually beat the first formulation when only a small number of the closest matches is wanted.

like image 165
Erwin Brandstetter Avatar answered Sep 20 '22 17:09

Erwin Brandstetter