I am trying to measure the load that various databases living on the same Postgres server are incurring, to determine how to best split them up across multiple servers. I devised this query:
select
now() as now,
datname as database,
usename as user,
count(*) as processes
from pg_stat_activity
where state = 'active'
and waiting = 'f'
and query not like '%from pg_stat_activity%'
group by
datname,
usename;
But there were surprisingly few active processes!
Digging deeper I ran a simple query that returns 20k rows and took 5 seconds to complete, according to the client I ran it from. When I queried pg_stat_activity during that time, the process was idle! I repeated this experiment several times.
The Postgres documentation says active means
The backend is executing a query.
and idle means
The backend is waiting for a new client command.
Is it really more nuanced than that? Why was the process running my query was not active when I checked in?
If this approach is flawed, what alternatives are there for measuring load at a database granularity than periodically sampling the number of active processes?
your expectations regarding active, idleand idle in transaction are very right. The only explanation I can think of is a huge delay in displaying data client side. So the query indeed finished on server and session is idle and yet you don't see the result with client.
regarding the load measurement - I would not rely on number of active sessions much. Pure luck to hit the fast query in active state. Eg hypothetically you can check pg_stat_activity each second and see one active session, but between measurement one db was queried 10 times and another once - yet none of those numbers will be seen. Because they were active between executions. And this 10+1 active states (although mean that one db is queried 10times more often) do not mean you should consider load at all - because cluster is so much not loaded, that you can't even catch executions. But this unavoidably mean that you can catch many active sessions and it would not mean that server is loaded indeed.
so at least take now()-query_start to your query to catch longer queries. Or even better save execution time for some often queries and measure if it degrades over time. Or better select pid and check resources eaten by that pid.
Btw for longer queries look into pg_stat_statements - looking how they change over time can give you some expectations on how the load changes
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With