Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Postgresql: How do I ensure that indexes are in memory

I have been trying out postgres 9.3 running on an Azure VM on Windows Server 2012. I was originally running it on a 7GB server... I am now running it on a 14GB Azure VM. I went up a size when trying to solve the problem described below.

I am quite new to posgresql by the way, so I am only getting to know the configuration options bit by bit. Also, while I'd love to run it on Linux, I and my colleagues simply don't have the expertise to address issues when things go wrong in Linux, so Windows is our only option.

Problem description:

I have a table called test_table; it currently stores around 90 million rows. It will grow by around 3-4 million rows per month. There are 2 columns in test_table:

id (bigserial)
url (charachter varying 300)

I created indexes after importing the data from a few CSV files. Both columns are indexed.... the id is the primary key. The index on the url is a normal btree created using the defaults through pgAdmin.

When I ran:

SELECT sum(((relpages*8)/1024)) as MB FROM pg_class WHERE reltype=0;

... The total size is 5980MB

The indiviual size of the 2 indexes in question here are as follows, and I got them by running:

 # SELECT relname, ((relpages*8)/1024) as MB, reltype FROM pg_class WHERE 
  reltype=0 ORDER BY relpages DESC LIMIT 10;


             relname      |  mb  | reltype
----------------------------------+------+--------
 test_url_idx             | 3684 |       0
 test_pk                  | 2161 |       0

There are other indexes on other smaller tables, but they are tiny (< 5MB).... so I ignored them here

The trouble when querying the test_table using the url, particularly when using a wildcard in the search, is the speed (or lack of it). e.g.

select * from test_table where url like 'orange%' limit 20;

...would take anything from 20-40 seconds to run.

Running explain analyze on the above gives the following:

# explain analyze select * from test_table where
   url like 'orange%' limit 20;

          QUERY PLAN
-----------------------------------------------------------------    
 Limit  (cost=0.00..4787.96 rows=20 width=57) 
     (actual time=0.304..1898.583 rows=20 loops=1)
   ->  Seq Scan on test_table  (cost=0.00..2303247.60 rows=9621 width=57)
     (actual time=0.302..1898
    .542 rows=20 loops=1)
     Filter: ((url)::text ~~ 'orange%'::text)
     Rows Removed by Filter: 210286
    Total runtime: 1898.650 ms
  (5 rows)

Taking another example... this time with the wildcard between american and .com....

# explain  select * from test_table where url 
   like 'american%.com' limit 50;

QUERY PLAN
-------------------------------------------------------
 Limit  (cost=0.00..11969.90 rows=50 width=57)
  ->  Seq Scan on test_table  (cost=0.00..2303247.60 rows=9621 width=57)
     Filter: ((url)::text ~~ 'american%.com'::text)
    (3 rows)


# explain analyze select * from test_table where url 
    like 'american%.com' limit 50;

QUERY PLAN
-----------------------------------------------------
 Limit  (cost=0.00..11969.90 rows=50 width=57) 
    (actual time=83.470..3035.696 rows=50      loops=1)
    ->  Seq Scan on test_table  (cost=0.00..2303247.60 rows=9621 width=57) 
            (actual time=83.467..303
  5.614 rows=50 loops=1)
     Filter: ((url)::text ~~ 'american%.com'::text)
     Rows Removed by Filter: 276142
 Total runtime: 3035.774 ms
(5 rows)

I then went from a 7GB to a 14GB server. Query Speeds were no better.

Observations on the server

  • I can see that Memory usage never really goes beyond 2MB.
  • Disk reads go off the charts when running a query using a LIKE statement.
  • Query speed is perfectly fine when matching against the id (primary key)

The postgresql.conf file has had only a few changes from the defaults. Note that I took some of these suggestions from the following blog post: http://www.gabrielweinberg.com/blog/2011/05/postgresql.html.

Changes to conf:

shared_buffers = 512MB  

checkpoint_segments = 10 

(I changed checkpoint_segments as I got lots of warnings when loading in CSV files... although a production database will not be very write intensive so this can be changed back to 3 if necessary...)

cpu_index_tuple_cost = 0.0005       
effective_cache_size = 10GB    # recommendation in the blog post was 2GB...

On the server itself, in the Task Manager -> Performance tab, the following are probably the relevant bits for someone who can assist:

CPU: rarely over 2% (regardless of what queries are run... it hit 11% once when I was importing a 6GB CSV file)

Memory: 1.5/14.0GB (11%)

More details on Memory:

  • In use: 1.4GB
  • Available: 12.5GB
  • Committed 1.9/16.1 GB
  • Cached: 835MB
  • Paged Pool: 95.2MB
  • Non-paged pool: 71.2 MB

Questions

  1. How can I ensure an index will sit in memory (providing it doesn't get too big for memory)? Is it just configuration tweaking I need here?
  2. Is implementing my own search index (e.g. Lucene) a better option here?
  3. Are the full-text indexing features in postgres going to improve performance dramatically, even if I can solve the index in memory issue?

Thanks for reading.

like image 214
collumbo Avatar asked Jan 07 '14 15:01

collumbo


1 Answers

Those seq scans make it look like you didn't run analyze on the table after importing your data.

http://www.postgresql.org/docs/current/static/sql-analyze.html

During normal operation, scheduling to run vacuum analyze isn't useful, because the autovacuum periodically kicks in. But it is important when doing massive writes, such as during imports.

On a slightly related note, see this reversed index tip on Pavel's PostgreSQL Tricks site, if you ever need to run anchord queries at the end, rather than at the beginning, e.g. like '%.com'

http://postgres.cz/wiki/PostgreSQL_SQL_Tricks_I#section_20


Regarding your actual questions, be wary that some of the suggestions in that post you liked to are dubious at best. Changing the cost of index use is frequently dubious and disabling seq scan is downright silly. (Sometimes, it is cheaper to seq scan a table than itis to use an index.)

With that being said:

  1. Postgres primarily caches indexes based on how often they're used, and it will not use an index if the stats suggest that it shouldn't -- hence the need to analyze after an import. Giving Postgres plenty of memory will, of course, increase the likelihood it's in memory too, but keep the latter points in mind.
  2. and 3. Full text search works fine.

For further reading on fine-tuning, see the manual and:

http://wiki.postgresql.org/wiki/Tuning_Your_PostgreSQL_Server

Two last notes on your schema:

  1. Last I checked, bigint (bigserial in your case) was slower than plain int. (This was a while ago, so the difference might now be negligible on modern, 64-bit servers.) Unless you foresee that you'll actually need more than 2.3 billion entries, int is plenty and takes less space.
  2. From an implementation standpoint, the only difference between a varchar(300) and a varchar without a specified length (or text, for that matter) is an extra check constraint on the length. If you don't actually need data to fit that size and are merely doing so for no reason other than habit, your db inserts and updates will run faster by getting rid of that constraint.
like image 190
Denis de Bernardy Avatar answered Oct 19 '22 03:10

Denis de Bernardy