I have been trying out postgres 9.3 running on an Azure VM on Windows Server 2012. I was originally running it on a 7GB server... I am now running it on a 14GB Azure VM. I went up a size when trying to solve the problem described below.
I am quite new to posgresql by the way, so I am only getting to know the configuration options bit by bit. Also, while I'd love to run it on Linux, I and my colleagues simply don't have the expertise to address issues when things go wrong in Linux, so Windows is our only option.
Problem description:
I have a table called test_table; it currently stores around 90 million rows. It will grow by around 3-4 million rows per month. There are 2 columns in test_table:
id (bigserial)
url (charachter varying 300)
I created indexes after importing the data from a few CSV files. Both columns are indexed.... the id is the primary key. The index on the url is a normal btree created using the defaults through pgAdmin.
When I ran:
SELECT sum(((relpages*8)/1024)) as MB FROM pg_class WHERE reltype=0;
... The total size is 5980MB
The indiviual size of the 2 indexes in question here are as follows, and I got them by running:
# SELECT relname, ((relpages*8)/1024) as MB, reltype FROM pg_class WHERE
reltype=0 ORDER BY relpages DESC LIMIT 10;
relname | mb | reltype
----------------------------------+------+--------
test_url_idx | 3684 | 0
test_pk | 2161 | 0
There are other indexes on other smaller tables, but they are tiny (< 5MB).... so I ignored them here
The trouble when querying the test_table using the url, particularly when using a wildcard in the search, is the speed (or lack of it). e.g.
select * from test_table where url like 'orange%' limit 20;
...would take anything from 20-40 seconds to run.
Running explain analyze on the above gives the following:
# explain analyze select * from test_table where
url like 'orange%' limit 20;
QUERY PLAN
-----------------------------------------------------------------
Limit (cost=0.00..4787.96 rows=20 width=57)
(actual time=0.304..1898.583 rows=20 loops=1)
-> Seq Scan on test_table (cost=0.00..2303247.60 rows=9621 width=57)
(actual time=0.302..1898
.542 rows=20 loops=1)
Filter: ((url)::text ~~ 'orange%'::text)
Rows Removed by Filter: 210286
Total runtime: 1898.650 ms
(5 rows)
Taking another example... this time with the wildcard between american and .com....
# explain select * from test_table where url
like 'american%.com' limit 50;
QUERY PLAN
-------------------------------------------------------
Limit (cost=0.00..11969.90 rows=50 width=57)
-> Seq Scan on test_table (cost=0.00..2303247.60 rows=9621 width=57)
Filter: ((url)::text ~~ 'american%.com'::text)
(3 rows)
# explain analyze select * from test_table where url
like 'american%.com' limit 50;
QUERY PLAN
-----------------------------------------------------
Limit (cost=0.00..11969.90 rows=50 width=57)
(actual time=83.470..3035.696 rows=50 loops=1)
-> Seq Scan on test_table (cost=0.00..2303247.60 rows=9621 width=57)
(actual time=83.467..303
5.614 rows=50 loops=1)
Filter: ((url)::text ~~ 'american%.com'::text)
Rows Removed by Filter: 276142
Total runtime: 3035.774 ms
(5 rows)
I then went from a 7GB to a 14GB server. Query Speeds were no better.
Observations on the server
The postgresql.conf file has had only a few changes from the defaults. Note that I took some of these suggestions from the following blog post: http://www.gabrielweinberg.com/blog/2011/05/postgresql.html.
Changes to conf:
shared_buffers = 512MB
checkpoint_segments = 10
(I changed checkpoint_segments as I got lots of warnings when loading in CSV files... although a production database will not be very write intensive so this can be changed back to 3 if necessary...)
cpu_index_tuple_cost = 0.0005
effective_cache_size = 10GB # recommendation in the blog post was 2GB...
On the server itself, in the Task Manager -> Performance tab, the following are probably the relevant bits for someone who can assist:
CPU: rarely over 2% (regardless of what queries are run... it hit 11% once when I was importing a 6GB CSV file)
Memory: 1.5/14.0GB (11%)
More details on Memory:
Questions
Thanks for reading.
Those seq scans make it look like you didn't run analyze
on the table after importing your data.
http://www.postgresql.org/docs/current/static/sql-analyze.html
During normal operation, scheduling to run vacuum analyze
isn't useful, because the autovacuum periodically kicks in. But it is important when doing massive writes, such as during imports.
On a slightly related note, see this reversed index tip on Pavel's PostgreSQL Tricks site, if you ever need to run anchord queries at the end, rather than at the beginning, e.g. like '%.com'
http://postgres.cz/wiki/PostgreSQL_SQL_Tricks_I#section_20
Regarding your actual questions, be wary that some of the suggestions in that post you liked to are dubious at best. Changing the cost of index use is frequently dubious and disabling seq scan is downright silly. (Sometimes, it is cheaper to seq scan a table than itis to use an index.)
With that being said:
analyze
after an import. Giving Postgres plenty of memory will, of course, increase the likelihood it's in memory too, but keep the latter points in mind.For further reading on fine-tuning, see the manual and:
http://wiki.postgresql.org/wiki/Tuning_Your_PostgreSQL_Server
Two last notes on your schema:
varchar(300)
and a varchar
without a specified length (or text
, for that matter) is an extra check constraint on the length. If you don't actually need data to fit that size and are merely doing so for no reason other than habit, your db inserts and updates will run faster by getting rid of that constraint.If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With