Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Improving query speed: simple SELECT with LIKE

I have inherited a large legacy codebase which runs in django 1.5 and my current task is to speed up a section of the site which takes ~1min to load.

I did a profile of the app and got this:

enter image description here

The culprit in particular is the following query (stripped for brevity):

SELECT COUNT(*) FROM "entities_entity" WHERE (
  "entities_entity"."date_filed" <= '2016-01-21' AND (
    UPPER("entities_entity"."entity_city_state_zip"::text) LIKE UPPER('%Atherton%') OR
    UPPER("entities_entity"."entity_city_state_zip"::text) LIKE UPPER('%Berkeley%') OR
    -- 34 more of these
    UPPER("entities_entity"."agent_city_state_zip"::text) LIKE UPPER('%Atherton%') OR
    UPPER("entities_entity"."agent_city_state_zip"::text) LIKE UPPER('%Berkeley%') OR
    -- 34 more of these
  )
)

which basically consist on a big like query on two fields, entity_city_state_zip and agent_city_state_zip which are character varying(200) | not null fields.

That query is performed twice (!), taking 18814.02ms each time, and one more time replacing the COUNT for a SELECT taking up an extra 20216.49 (I'm going to cache the result of the COUNT)

The explain looks like this:

Aggregate  (cost=175867.33..175867.34 rows=1 width=0) (actual time=17841.502..17841.502 rows=1 loops=1)
  ->  Seq Scan on entities_entity  (cost=0.00..175858.95 rows=3351 width=0) (actual time=0.849..17818.551 rows=145075 loops=1)
        Filter: ((date_filed <= '2016-01-21'::date) AND ((upper((entity_city_state_zip)::text) ~~ '%ATHERTON%'::text) OR (upper((entity_city_state_zip)::text) ~~ '%BERKELEY%'::text) (..skipped..) OR (upper((agent_city_state_zip)::text) ~~ '%ATHERTON%'::text) OR (upper((agent_city_state_zip)::text) ~~ '%BERKELEY%'::text) OR (upper((agent_city_state_zip)::text) ~~ '%BURLINGAME%'::text) ))
        Rows Removed by Filter: 310249
Planning time: 2.110 ms
Execution time: 17841.944 ms

I've tried using an index on entity_city_state_zip and agent_city_state_zip using various combinations like:

CREATE INDEX ON entities_entity (upper(entity_city_state_zip));
CREATE INDEX ON entities_entity (upper(agent_city_state_zip));

or using varchar_pattern_ops, with no luck.

The server is using something like this:

qs = queryset.filter(Q(entity_city_state_zip__icontains = all_city_list) |
                     Q(agent_city_state_zip__icontains = all_city_list))

to generate that query.

I don't know what else to try,

Thanks!

like image 561
nicosantangelo Avatar asked Oct 30 '22 11:10

nicosantangelo


2 Answers

I think problem in "multiple LIKE" and in UPPER("entities_entity ...

You can use:

WHERE entities_entity.entity_city_state_zip SIMILAR TO '%Atherton%|%Berkeley%'

Or something like this:

WHERE entities_entity.entity_city_state_zip LIKE ANY(ARRAY['%Atherton%', '%Berkeley%'])


Edited

About Raw SQL query in Django:

  1. https://docs.djangoproject.com/es/1.9/topics/db/sql/
  2. How do I execute raw SQL in a django migration

Regards

like image 196
Volodymyr Matvienko Avatar answered Nov 15 '22 08:11

Volodymyr Matvienko


I watched a course in Pluralsight that addressed a very similar issue. The course was "Postgres for .NET Developers" and this was in the section "Fun With Simple SQL", "Full Text Search."

To summarize their solution, using your example:

Create a new column in your table that will represent your entity_city_state_zip as a tsvector:

create table entities_entity (
  date_filed date,
  entity_city_state_zip text,
  csz_search tsvector not null   -- add this column
);

Initially you might have to make it nullable, then populate the data and make it non-nullable.

update entities_entity
set csz_search = to_tsvector (entity_city_state_zip);

Next, create a trigger that will cause the new field to be populated any time a record is added or modified:

create trigger entities_insert_update
before insert or update on entities_entity
for each row execute procedure
tsvector_update_trigger(csz_search,'pg_catalog.english',entity_city_state_zip);

Your search queries can now query on the tsvector field rather than the city/state/zip field:

select * from entities_entity
where csz_search @@ to_tsquery('Atherton')

Some notes of interest on this:

  • to_tsquery, in case you haven't used it is WAY more sophisticated than the example above. It allows and conditions, partial matches, etc
  • it is also case-insensitive, so there is no need to do the upper functions you have in your query

As a final step, put a GIN index on the tsquery field:

create index entities_entity_ix1 on entities_entity
using gin(csz_search);

If I understand the course right, this should make your query fly, and it will overcome the issue of a btree index's inability to work on a like '% query.

Here is the explain plan on such a query:

Bitmap Heap Scan on entities_entity  (cost=56.16..1204.78 rows=505 width=81)
  Recheck Cond: (csz_search @@ to_tsquery('Atherton'::text))
  ->  Bitmap Index Scan on entities_entity_ix1  (cost=0.00..56.04 rows=505 width=0)
        Index Cond: (csz_search @@ to_tsquery('Atherton'::text))
like image 24
Hambone Avatar answered Nov 15 '22 10:11

Hambone