Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Issues optimizing postgres search query

I am having an issue with the following PostgreSQL query it takes more than 10 seconds to run is there any way to speed this query up to a rational speed, I am simply looking for the most relevant search terms associated with videos on a very large database.

  SELECT count(*), videoid 
  FROM term_search 
  where word = 'tester' 
     OR word = 'question' 
     OR word = 'one' 
  group by videoid 
  order by count(*) desc 
  limit 1800;

When the query is run with analyze the resultant query plan is as follows (http://explain.depesz.com/s/yDJ):

  Limit  (cost=389625.50..389630.00 rows=1800 width=4) (actual time=11766.693..11770.001 rows=1800 loops=1)
     Output: (count(*)), videoid
     ->  Sort  (cost=389625.50..389692.68 rows=26873 width=4) (actual time=11766.689..11767.818 rows=1800 loops=1)
           Output: (count(*)), videoid
           Sort Key: (count(*))
           Sort Method: top-N heapsort  Memory: 181kB
           ->  HashAggregate  (cost=387769.41..388038.14 rows=26873 width=4) (actual time=9215.653..10641.993 rows=1632578 loops=1)
                 Output: count(*), videoid
                 ->  Bitmap Heap Scan on public.term_search  (cost=44915.83..378163.38 rows=1921207 width=4) (actual time=312.449..7026.036 rows=2047691 loops=1)
                       Output: id, videoid, word, termindex, weight
                       Recheck Cond: (((term_search.word)::text = 'tester'::text) OR ((term_search.word)::text = 'question'::text) OR ((term_search.word)::text = 'one'::text))
                       Rows Removed by Index Recheck: 25512434
                       ->  BitmapOr  (cost=44915.83..44915.83 rows=1950031 width=0) (actual time=288.937..288.937 rows=0 loops=1)
                             ->  Bitmap Index Scan on terms_word_idx  (cost=0.00..8552.83 rows=383502 width=0) (actual time=89.266..89.266 rows=419750 loops=1)
                                   Index Cond: ((term_search.word)::text = 'tester'::text)
                             ->  Bitmap Index Scan on terms_word_idx  (cost=0.00..13171.84 rows=590836 width=0) (actual time=89.700..89.700 rows=604348 loops=1)
                                   Index Cond: ((term_search.word)::text = 'question'::text)
                             ->  Bitmap Index Scan on terms_word_idx  (cost=0.00..21750.26 rows=975693 width=0) (actual time=109.964..109.964 rows=1023593 loops=1)
                                   Index Cond: ((term_search.word)::text = 'one'::text)

The schema for the table is as follows:

    Column   |          Type          |                        Modifiers                         | Storage  | Description 
  -----------+------------------------+----------------------------------------------------------+----------+-------------
   id        | integer                | not null default nextval('term_search_id_seq'::regclass) | plain    | 
   videoid   | integer                |                                                          | plain    | 
   word      | character varying(100) |                                                          | extended | 
   termindex | character varying(15)  |                                                          | extended | 
   weight    | smallint               |                                                          | plain    | 
  Indexes:
      "term_search_pkey" PRIMARY KEY, btree (id)
      "search_term_exists_idx" btree (videoid, word)
      "terms_caverphone_idx" btree (termindex)
      "terms_video_idx" btree (videoid)
      "terms_word_idx" btree (word, videoid)
  Foreign-key constraints:
      "term_search_videoid_fkey" FOREIGN KEY (videoid) REFERENCES videos(id) ON DELETE CASCADE
  Has OIDs: no

I have managed to get it down to 7 seconds with Index Only scans but it was still not low enough. I am running PostgreSQL 9.3 on Ubuntu 14.04 on an aws r3.xlarge instance, with approx 50 million rows in the table. Any advice is greatly appreciated!

EDIT:

Attached is the result of SELECT schemaname,tablename,attname,null_frac,avg_width,n_distinct FROM pg_stats WHERE schemaname='public' and tablename='term_search';

 schemaname |  tablename  |  attname  | null_frac | avg_width | n_distinct 
 ------------+-------------+-----------+-----------+-----------+------------
 public     | term_search | id        |         0 |         4 |         -1
 public     | term_search | videoid   |         0 |         4 |     568632
 public     | term_search | word      |         0 |         6 |       5054
 public     | term_search | termindex |         0 |        11 |       2485
 public     | term_search | weight    |         0 |         2 |          3
like image 742
user3354369 Avatar asked Nov 01 '22 22:11

user3354369


2 Answers

If I have a chance to disconnect users for a night I would:

  • create a new table with words from term_search,
  • create reference to the new table,
  • drop column word,

something like this:

create table words (
    word_id serial primary key,
    word text);

insert into words (word)
    select distinct word
    from term_search;

alter table term_search add column word_id integer;

update term_search t
    set word_id = w.word_id
    from words w
    where t.word = w.word;

alter table term_search add constraint term_search_word_fkey 
    foreign key (word_id) references words (word_id);

Test:

SELECT count(*), videoid 
FROM term_search t
JOIN words w on t.word_id = w.word_id
WHERE w.word = 'tester' 
   OR w.word = 'question' 
   OR w.word = 'one' 
GROUP BY videoid 
ORDER BY count(*) desc 
LIMIT 1800;    

-- if was faster then
    alter table term_search drop column word;
-- and on the fly...
    alter table term_search alter termindex type text;

After the revolution I'd have to take care of inserts and updates on term_search. I'd probably create a view with rules for insert and update.

like image 117
klin Avatar answered Nov 08 '22 11:11

klin


Let's start by rephrasing the query to explain what it's really trying to do.

The query:

  SELECT count(*), videoid 
  FROM term_search 
  where word = 'tester' 
     OR word = 'question' 
     OR word = 'one' 
  group by videoid 
  order by count(*) desc 
  limit 1800;

seems to mean:

"In a table of search terms, find me videos with the search terms tester, question or one. Count the matches for each video and return the 1800 videos with the most matches".

or, more generally:

"Find me the videos that best match my search terms and show me the top n best matches".

Correct?

If so, why aren't you using PostgreSQL's built-in full-text search and full-text indexing? An indexed tsquery match against a tsvector per video is likely to be a win here. Full-text search has fuzzy matching, ranking, and pretty much everything else you're going to want - and unlike your current approach it won't require the whole data set to be materialized and sorted only to discard most of it.

You haven't provided sample data, so I can't really do a demo.


How PostgreSQL currently executes your query could be explained like this:

  • Create a map with one bit for every disk page (8kb) in the table, where true indicates that the page might contain one or more matching rows.

  • For each search term, scan the index terms_word_idx and update the bitmap to set the bit where a match is found

  • Scan the table, skipping over pages where the bitmap says there can't be any matches, looking for rows that have any of the words. This is like a fast, skip-over-blanks seqscan. It's actually not tons faster than a plain seqscan if the percentage of matches is high.

  • For each matching row, sort it into a series of "buckets" based on the video id. Then at the end, count how many rows are in each bucket and return the count + the video ID for that bucket. (It's not that simple, but close enough).

  • As you count each bucket, put the result in between the results with next-highest and next-lowest counts.

    • Take the top 1800 results, and throw away all the rest of your hard work.

That doesn't sound like much fun, but it doesn't have any choice. A b-tree index can't be descended to search simultaneously for multiple terms, so it has to do multiple index scans. The rest kind of follows from that.

So: to make this more efficient, you need to fundamentally change how you tackle the problem. Adding an index or tuning some parameters isn't going to suddenly make this take 0.5s.

like image 31
Craig Ringer Avatar answered Nov 08 '22 11:11

Craig Ringer