I am having an issue with the following PostgreSQL query it takes more than 10 seconds to run is there any way to speed this query up to a rational speed, I am simply looking for the most relevant search terms associated with videos on a very large database.
SELECT count(*), videoid
FROM term_search
where word = 'tester'
OR word = 'question'
OR word = 'one'
group by videoid
order by count(*) desc
limit 1800;
When the query is run with analyze the resultant query plan is as follows (http://explain.depesz.com/s/yDJ):
Limit (cost=389625.50..389630.00 rows=1800 width=4) (actual time=11766.693..11770.001 rows=1800 loops=1)
Output: (count(*)), videoid
-> Sort (cost=389625.50..389692.68 rows=26873 width=4) (actual time=11766.689..11767.818 rows=1800 loops=1)
Output: (count(*)), videoid
Sort Key: (count(*))
Sort Method: top-N heapsort Memory: 181kB
-> HashAggregate (cost=387769.41..388038.14 rows=26873 width=4) (actual time=9215.653..10641.993 rows=1632578 loops=1)
Output: count(*), videoid
-> Bitmap Heap Scan on public.term_search (cost=44915.83..378163.38 rows=1921207 width=4) (actual time=312.449..7026.036 rows=2047691 loops=1)
Output: id, videoid, word, termindex, weight
Recheck Cond: (((term_search.word)::text = 'tester'::text) OR ((term_search.word)::text = 'question'::text) OR ((term_search.word)::text = 'one'::text))
Rows Removed by Index Recheck: 25512434
-> BitmapOr (cost=44915.83..44915.83 rows=1950031 width=0) (actual time=288.937..288.937 rows=0 loops=1)
-> Bitmap Index Scan on terms_word_idx (cost=0.00..8552.83 rows=383502 width=0) (actual time=89.266..89.266 rows=419750 loops=1)
Index Cond: ((term_search.word)::text = 'tester'::text)
-> Bitmap Index Scan on terms_word_idx (cost=0.00..13171.84 rows=590836 width=0) (actual time=89.700..89.700 rows=604348 loops=1)
Index Cond: ((term_search.word)::text = 'question'::text)
-> Bitmap Index Scan on terms_word_idx (cost=0.00..21750.26 rows=975693 width=0) (actual time=109.964..109.964 rows=1023593 loops=1)
Index Cond: ((term_search.word)::text = 'one'::text)
The schema for the table is as follows:
Column | Type | Modifiers | Storage | Description
-----------+------------------------+----------------------------------------------------------+----------+-------------
id | integer | not null default nextval('term_search_id_seq'::regclass) | plain |
videoid | integer | | plain |
word | character varying(100) | | extended |
termindex | character varying(15) | | extended |
weight | smallint | | plain |
Indexes:
"term_search_pkey" PRIMARY KEY, btree (id)
"search_term_exists_idx" btree (videoid, word)
"terms_caverphone_idx" btree (termindex)
"terms_video_idx" btree (videoid)
"terms_word_idx" btree (word, videoid)
Foreign-key constraints:
"term_search_videoid_fkey" FOREIGN KEY (videoid) REFERENCES videos(id) ON DELETE CASCADE
Has OIDs: no
I have managed to get it down to 7 seconds with Index Only scans but it was still not low enough. I am running PostgreSQL 9.3 on Ubuntu 14.04 on an aws r3.xlarge instance, with approx 50 million rows in the table. Any advice is greatly appreciated!
EDIT:
Attached is the result of SELECT schemaname,tablename,attname,null_frac,avg_width,n_distinct FROM pg_stats WHERE schemaname='public' and tablename='term_search';
schemaname | tablename | attname | null_frac | avg_width | n_distinct
------------+-------------+-----------+-----------+-----------+------------
public | term_search | id | 0 | 4 | -1
public | term_search | videoid | 0 | 4 | 568632
public | term_search | word | 0 | 6 | 5054
public | term_search | termindex | 0 | 11 | 2485
public | term_search | weight | 0 | 2 | 3
If I have a chance to disconnect users for a night I would:
words
from term_search
,word
, something like this:
create table words (
word_id serial primary key,
word text);
insert into words (word)
select distinct word
from term_search;
alter table term_search add column word_id integer;
update term_search t
set word_id = w.word_id
from words w
where t.word = w.word;
alter table term_search add constraint term_search_word_fkey
foreign key (word_id) references words (word_id);
Test:
SELECT count(*), videoid
FROM term_search t
JOIN words w on t.word_id = w.word_id
WHERE w.word = 'tester'
OR w.word = 'question'
OR w.word = 'one'
GROUP BY videoid
ORDER BY count(*) desc
LIMIT 1800;
-- if was faster then
alter table term_search drop column word;
-- and on the fly...
alter table term_search alter termindex type text;
After the revolution I'd have to take care of inserts and updates on term_search
. I'd probably create a view with rules for insert and update.
Let's start by rephrasing the query to explain what it's really trying to do.
The query:
SELECT count(*), videoid
FROM term_search
where word = 'tester'
OR word = 'question'
OR word = 'one'
group by videoid
order by count(*) desc
limit 1800;
seems to mean:
"In a table of search terms, find me videos with the search terms tester
, question
or one
. Count the matches for each video and return the 1800 videos with the most matches".
or, more generally:
"Find me the videos that best match my search terms and show me the top n best matches".
Correct?
If so, why aren't you using PostgreSQL's built-in full-text search and full-text indexing? An indexed tsquery
match against a tsvector
per video is likely to be a win here. Full-text search has fuzzy matching, ranking, and pretty much everything else you're going to want - and unlike your current approach it won't require the whole data set to be materialized and sorted only to discard most of it.
You haven't provided sample data, so I can't really do a demo.
How PostgreSQL currently executes your query could be explained like this:
Create a map with one bit for every disk page (8kb) in the table, where true indicates that the page might contain one or more matching rows.
For each search term, scan the index terms_word_idx
and update the bitmap to set the bit where a match is found
Scan the table, skipping over pages where the bitmap says there can't be any matches, looking for rows that have any of the words. This is like a fast, skip-over-blanks seqscan. It's actually not tons faster than a plain seqscan if the percentage of matches is high.
For each matching row, sort it into a series of "buckets" based on the video id. Then at the end, count how many rows are in each bucket and return the count + the video ID for that bucket. (It's not that simple, but close enough).
As you count each bucket, put the result in between the results with next-highest and next-lowest counts.
That doesn't sound like much fun, but it doesn't have any choice. A b-tree index can't be descended to search simultaneously for multiple terms, so it has to do multiple index scans. The rest kind of follows from that.
So: to make this more efficient, you need to fundamentally change how you tackle the problem. Adding an index or tuning some parameters isn't going to suddenly make this take 0.5s.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With