Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

PostgreSQL: How to optimize my database for storing and querying a huge graph

I'm running PostgreSQL 8.3 on a 1.83 GHz Intel Core Duo Mac Mini with 1GB of RAM and Mac OS X 10.5.8. I have a stored a huge graph in my PostgreSQL database. It consists of 1.6 million nodes and 30 million edges. My database schema is like:

CREATE TABLE nodes (id INTEGER PRIMARY KEY,title VARCHAR(256));
CREATE TABLE edges (id INTEGER,link INTEGER,PRIMARY KEY (id,link));
CREATE INDEX id_idx ON edges (id);
CREATE INDEX link_idx ON edges (link);

The data in the table edges looks like

id link 
1  234
1  88865
1  6
2  365
2  12
...

So it stores for each node with id x the outgoing link to id y.

The time for searching all the outgoing links is ok:

=# explain analyze select link from edges where id=4620;
                           QUERY PLAN                                                        
    ---------------------------------------------------------------------------------
     Index Scan using id_idx on edges  (cost=0.00..101.61 rows=3067 width=4) (actual time=135.507..157.982 rows=1052 loops=1)
       Index Cond: (id = 4620)
     Total runtime: 158.348 ms
    (3 rows)

However, if I search for the incoming links to a node, the database is more than 100 times slower (although the resulting number of incoming links is only 5-10 times higher than the number of outgoing links):

=# explain analyze select id from edges where link=4620;
                         QUERY PLAN                                                           
----------------------------------------------------------------------------------
     Bitmap Heap Scan on edges  (cost=846.31..100697.48 rows=51016 width=4) (actual time=322.584..48983.478 rows=26887 loops=1)
       Recheck Cond: (link = 4620)
       ->  Bitmap Index Scan on link_idx  (cost=0.00..833.56 rows=51016 width=0) (actual time=298.132..298.132 rows=26887 loops=1)
             Index Cond: (link = 4620)
     Total runtime: 49001.936 ms
    (5 rows)

I tried to force Postgres not to use a Bitmap Scan via

=# set enable_bitmapscan = false;

but the speed of the query for incoming links didn't improve:

=# explain analyze select id from edges where link=1588;
                      QUERY PLAN                                                           
-------------------------------------------------------------------------------------------
 Index Scan using link_idx on edges  (cost=0.00..4467.63 rows=1143 width=4) (actual time=110.302..51275.822 rows=43629 loops=1)
   Index Cond: (link = 1588)
 Total runtime: 51300.041 ms
(3 rows)

I also increased my shared buffers from 24MB to 512MB, but it didn't help. So I wonder why my queries for outgoing and incoming links show such an asymmetric behaviour? Is something wrong with my choice of indexes? Or should I better create a third table holding all the incoming links for a node with id x? But that would be quite a waste of disk space. But since I'm new into SQL databases maybe I'm missing something basic here?

like image 280
asmaier Avatar asked Nov 30 '09 22:11

asmaier


1 Answers

I guess it is because of a “density” of same-key-records on the disk. I think the records with same id are stored in dense (i.e., few number of blocks) and those with same link are stored in sparse (i.e., distributed to huge number of blocks). If you have inserted records in the order of id, this situation can be happen.

Assume that: 1. there are 10,000 records, 2. they're stored in the order such as (id, link) = (1, 1), (1, 2),..., (1, 100), (2, 1)..., and 3. 50 records can be stored in a block.

In the assumption above, block #1~#3 consists of the records (1, 1)~(1, 50), (1, 51)~(1, 100) and (2, 1)~(2, 50) respectively.

When you SELECT * FROM edges WHERE id=1, only 2 blocks (#1, #2) is to be loaded and scanned. On the other hand, SELECT * FROM edges WHERE link=1 requires 50 blocks (#1, #3, #5,...), even though the number of rows are same.

like image 62
habe Avatar answered Oct 21 '22 18:10

habe