Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Django Full Text Search Optimization - Postgres

I am trying to create a Full Text Search for an address autocomplete feature, leveraging Django (v2.1) and Postgres (9.5), but the performance is not suitable for an auto-complete at the moment and I don't get the logic behind the performance results I get. For info the table is quite sizeable, with 14 million rows.

My model:

from django.db import models
from postgres_copy import CopyManager
from django.contrib.postgres.indexes import GinIndex

class Addresses(models.Model):
date_update = models.DateTimeField(auto_now=True, null=True)
longitude = models.DecimalField(max_digits=9, decimal_places=6 , null=True)
latitude = models.DecimalField(max_digits=9, decimal_places=6 , null=True)
number = models.CharField(max_length=16, null=True, default='')
street = models.CharField(max_length=60, null=True, default='')
unit = models.CharField(max_length=50, null=True, default='')
city = models.CharField(max_length=50, null=True, default='')
district = models.CharField(max_length=10, null=True, default='')
region = models.CharField(max_length=5, null=True, default='')
postcode = models.CharField(max_length=5, null=True, default='')
addr_id = models.CharField(max_length=20, unique=True)
addr_hash = models.CharField(max_length=20, unique=True)
objects = CopyManager()

class Meta:
    indexes = [
        GinIndex(fields=['number', 'street', 'unit', 'city', 'region', 'postcode'], name='search_idx')
    ]

I created a little test to check performance based on number of words in the search:

    search_vector = SearchVector('number', 'street', 'unit', 'city', 'region', 'postcode')

    searchtext1 = "north"
    searchtext2 = "north bondi"
    searchtext3 = "north bondi blair"
    searchtext4 = "north bondi blair street 2026"

    print('Test1: 1 word')
    start_time = time.time()
    result = AddressesAustralia.objects.annotate(search=search_vector).filter(search=searchtext1)[:10]
    #print(len(result))
    time_exec = str(timedelta(seconds=time.time() - start_time))
    print(time_exec)
    print(' ')

    #print(AddressesAustralia.objects.annotate(search=search_vector).explain(verbose=True))

    print('Test2: 2 words')
    start_time = time.time()
    result = AddressesAustralia.objects.annotate(search=search_vector).filter(search=searchtext2)[:10]
    #print(len(result))
    time_exec = str(timedelta(seconds=time.time() - start_time))
    print(time_exec)
    print(' ')

    print('Test3: 3 words')
    start_time = time.time()
    result = AddressesAustralia.objects.annotate(search=search_vector).filter(search=searchtext3)[:10]
    #print(len(result))
    time_exec = str(timedelta(seconds=time.time() - start_time))
    print(time_exec)
    print(' ')

    print('Test4: 5 words')
    start_time = time.time()
    result = AddressesAustralia.objects.annotate(search=search_vector).filter(search=searchtext4)[:10]
    #print(len(result))
    time_exec = str(timedelta(seconds=time.time() - start_time))
    print(time_exec)
    print(' ')

I get the following results, which seems quite correct:

Test1: 1 word
0:00:00.001841

Test2: 2 words
0:00:00.001422

Test3: 3 words
0:00:00.001574

Test4: 5 words
0:00:00.001360

However if I uncomment the print(len(results)) lines, I get the following results:

Test1: 1 word
10
0:00:00.046392

Test2: 2 words
10
0:00:06.544732

Test3: 3 words
10
0:01:12.367157

Test4: 5 words
10
0:01:17.786596

This is obviously not suitable for an autocomplete feature.

Could someone be able to explain why the execution takes longer when performing an operation on the queryset result? It seems the Database retrieval is always fast but then going through the results takes time, which does not make sense to me as because I'm limiting the results to 10, the queryset returned is always the same size.

Also, although I have created a GIN index, this index does not seem to be used. It seems it has been created correctly:

=# \d public_data_au_addresses
                                   Table 
"public.public_data_au_addresses"
Column    |           Type           | Collation | Nullable |                            
Default                            
-------------+--------------------------+-----------+----------+------ 
---------------------------------------------------------
id          | integer                  |           | not null | 
nextval('public_data_au_addresses_id_seq'::regclass)
date_update | timestamp with time zone |           |          | 
longitude   | numeric(9,6)             |           |          | 
latitude    | numeric(9,6)             |           |          | 
number      | character varying(16)    |           |          | 
street      | character varying(60)    |           |          | 
unit        | character varying(50)    |           |          | 
city        | character varying(50)    |           |          | 
district    | character varying(10)    |           |          | 
region      | character varying(5)     |           |          | 
postcode    | character varying(5)     |           |          | 
addr_id     | character varying(20)    |           | not null | 
addr_hash   | character varying(20)    |           | not null | 
Indexes:
"public_data_au_addresses_pkey" PRIMARY KEY, btree (id)
"public_data_au_addresses_addr_hash_key" UNIQUE CONSTRAINT, btree (addr_hash)
"public_data_au_addresses_addr_id_key" UNIQUE CONSTRAINT, btree (addr_id)
"public_data_au_addresses_addr_hash_e8c67a89_like" btree (addr_hash varchar_pattern_ops)
"public_data_au_addresses_addr_id_9ee00c76_like" btree (addr_id varchar_pattern_ops)
"search_idx" gin (number, street, unit, city, region, postcode)

When I run the explain() method on my query I get that:

Test1: 1 word
Limit  (cost=0.00..1110.60 rows=10 width=140)
->  Seq Scan on public_data_au_addresses  (cost=0.00..8081472.41 rows=72767 width=140)
    Filter: (to_tsvector((((((((((((COALESCE(number, ''::character varying))::text || ' '::text) || (COALESCE(street, ''::character varying))::text) || ' '::text) || (COALESCE(unit, ''::character varying))::text) || ' '::text) || (COALESCE(city, ''::character varying))::text) || ' '::text) || (COALESCE(region, ''::character varying))::text) || ' '::text) || (COALESCE(postcode, ''::character varying))::text)) @@ plainto_tsquery('north'::text))

So it still shows a Sequential scan instead of using an Index scan. Does anyone know how to fix or debug that?

Would a GIN index still be efficient with so many fields to search on anyway?

And lastly, does anyone know how I can improve the code to improve the performance further?

Thank you! Regards

Update

I have tried to create a search vector as suggested by Paolo below, but it seems the search is still sequential and not leveraging the GIN index.

class AddressesQuerySet(CopyQuerySet):

    def update_search_vector(self):
        return self.update(search_vector=SearchVector('number', 'street', 'unit', 'city', 'region', 'postcode', config='english'))


class AddressesAustralia(models.Model):
    date_update = models.DateTimeField(auto_now=True, null=True)
    longitude = models.DecimalField(max_digits=9, decimal_places=6 , null=True)
    latitude = models.DecimalField(max_digits=9, decimal_places=6 , null=True)
    number = models.CharField(max_length=16, null=True, default='')
    street = models.CharField(max_length=60, null=True, default='')
    unit = models.CharField(max_length=50, null=True, default='')
    city = models.CharField(max_length=50, null=True, default='')
    district = models.CharField(max_length=10, null=True, default='')
    region = models.CharField(max_length=5, null=True, default='')
    postcode = models.CharField(max_length=5, null=True, default='')
    addr_id = models.CharField(max_length=20, unique=True)
    addr_hash = models.CharField(max_length=20, unique=True)
    search_vector = SearchVectorField(null=True, editable=False)

    objects = AddressesQuerySet.as_manager()

    class Meta:
        indexes = [
            GinIndex(fields=['search_vector'], name='search_vector_idx')
        ]

I have then updated the search_vector field using the update command:

AddressesAustralia.objects.update_search_vector()

Then I ran a query to test with the same search vector:

class Command(BaseCommand):

    def handle(self, *args, **options):

        search_vector = SearchVector('number', 'street', 'unit', 'city', 'region', 'postcode', config='english')

        searchtext1 = "north"

        print('Test1: 1 word')
        start_time = time.time()
        result = AddressesAustralia.objects.filter(search_vector=searchtext1)[:10].explain(verbose=True)
        print(len(result))
        print(result)
        time_exec = str(timedelta(seconds=time.time() - start_time))
        print(time_exec)

And I get the following results, still showing a sequential search:

Test1: 1 word
532
Limit  (cost=0.00..120.89 rows=10 width=235)
  Output: id, date_update, longitude, latitude, number, street, unit, city, district, region, postcode, addr_id, addr_hash, search_vector
  ->  Seq Scan on public.public_data_au_addressesaustralia  (cost=0.00..5061078.91 rows=418651 width=235)
        Output: id, date_update, longitude, latitude, number, street, unit, city, district, region, postcode, addr_id, addr_hash, search_vector
        Filter: (public_data_au_addressesaustralia.search_vector @@ plainto_tsquery('north'::text))
0:00:00.075262

I have also tried:

  • With and without the config="english" in the search vector (both in the update and in the query)

  • To delete the GIN index, then recreate it and then re-run the update_search_Vector

But still the same results. Any idea on what I am doing wrong or how I can troubleshoot further?

like image 622
shitzuu Avatar asked Dec 16 '18 04:12

shitzuu


1 Answers

As already suggested by @knbk for performance improvement you have to read the Full-text search Performance section in the Django documentation.

"If this approach becomes too slow, you can add a SearchVectorField to your model."

In your code you can add a search vector field in your model with a related GIN index and a queryset with a new method to update the field:

from django.contrib.postgres.indexes import GinIndex
from django.contrib.postgres.search import SearchVector, SearchVectorField
from django.db import models
from postgres_copy import CopyQuerySet


class AddressesQuerySet(CopyQuerySet):

    def update_search_vector(self):
        return self.update(search_vector=SearchVector(
            'number', 'street', 'unit', 'city', 'region', 'postcode'
        ))


class Addresses(models.Model):
    date_update = models.DateTimeField(auto_now=True, null=True)
    longitude = models.DecimalField(max_digits=9, decimal_places=6, null=True)
    latitude = models.DecimalField(max_digits=9, decimal_places=6, null=True)
    number = models.CharField(max_length=16, null=True, default='')
    street = models.CharField(max_length=60, null=True, default='')
    unit = models.CharField(max_length=50, null=True, default='')
    city = models.CharField(max_length=50, null=True, default='')
    district = models.CharField(max_length=10, null=True, default='')
    region = models.CharField(max_length=5, null=True, default='')
    postcode = models.CharField(max_length=5, null=True, default='')
    addr_id = models.CharField(max_length=20, unique=True)
    addr_hash = models.CharField(max_length=20, unique=True)
    search_vector = SearchVectorField(null=True, editable=False)

    objects = AddressesQuerySet.as_manager()

    class Meta:
        indexes = [
            GinIndex(fields=['search_vector'], name='search_vector_idx')
        ]

You can update your new search vector field using the new queryset method:

>>> Addresses.objects.update_search_vector()
UPDATE "addresses_addresses"
SET "search_vector" = to_tsvector(
  COALESCE("addresses_addresses"."number", '') || ' ' ||
  COALESCE("addresses_addresses"."street", '') || ' ' ||
  COALESCE("addresses_addresses"."unit", '') || ' ' ||
  COALESCE("addresses_addresses"."city", '') || ' ' ||
  COALESCE("addresses_addresses"."region", '') || ' ' ||
  COALESCE("addresses_addresses"."postcode", '')
)

If you execute a query and read the explain you can see your GIN index used:

>>> print(Addresses.objects.filter(search_vector='north').values('id').explain(verbose=True))
EXPLAIN (VERBOSE true)
SELECT "addresses_addresses"."id"
FROM "addresses_addresses"
WHERE "addresses_addresses"."search_vector" @@ (plainto_tsquery('north')) = true [0.80ms]
Bitmap Heap Scan on public.addresses_addresses  (cost=12.25..16.52 rows=1 width=4)
  Output: id
  Recheck Cond: (addresses_addresses.search_vector @@ plainto_tsquery('north'::text))
  ->  Bitmap Index Scan on search_vector_idx  (cost=0.00..12.25 rows=1 width=0)
        Index Cond: (addresses_addresses.search_vector @@ plainto_tsquery('north'::text))

If you want to deepen further you can read an article that I wrote on the subject:

"Full-Text Search in Django with PostgreSQL"

Update

I tried execute the SQL generate by Django ORM: http://sqlfiddle.com/#!17/f9aa9/1

like image 110
Paolo Melchiorre Avatar answered Nov 15 '22 16:11

Paolo Melchiorre