Been tearing my hair out for a couple days over this one. We've been having some trouble with exclusive database locks causing performance problems in our production system for a while. I was able to take a closer look at it, and am noticing the queries holding the exclusive locks are selects generated by Hibernate's lazy loading.
We're using Spring transaction management, @Transactional( readOnly= "true")
is defined at the service entry point. We use session-per-request model with entities mapped to transfer objects. Database default isolation level is read committed. JDBC driver is configured with read committed. I've checked the isolation level of the actual transaction in question using:
select current_setting('transaction_isolation')
Which returns read committed. We're using JPA to configure Hibernate mappings. Nowhere do we explicitly upgrade the transaction. In this particular transaction we are running select statements only. Turning on Hibernate SQL logging I do not see any of these:
select ... for update
Only simple select statements are being logged.
It seems like one of two things is going on here. Either my understanding of read committed is completely off and read committed isolation levels SHOULD result in exclusive row-level locks held for the duration of the transaction performing the selects. Or something else is going on and incorrectly upgrading the locks held by the transaction.
Any help would be appreciated.
Edit 1:
Ok, been a long winding path on this one. Turns out this has nothing to do with locking at all. The query I was using to detect locks was outdated, and is showing lock type of "virtualxid". Some digging tells us virtualxid is the lock every transaction takes out on itself, for internal to PostgreSQL reasons not germane to this discussion. We cron'd another monitoring query testing for true exclusive locks, and have not seen one yet.
Here is the query we're using to monitor for "virtualxid" locks, which is more like a long-running query monitor at this point:
SELECT pg_stat_activity.datname, pg_locks.mode, pg_locks.locktype, pg_locks.granted, pg_stat_activity.usename,pg_stat_activity.query,age(now(),pg_stat_activity.query_start) AS "age", pg_stat_activity.pid
FROM pg_stat_activity,pg_locks
LEFT OUTER JOIN pg_class ON (pg_locks.relation = pg_class.oid)
WHERE
age(now(),pg_stat_activity.query_start) > interval '1 minute' AND
pg_stat_activity.datname <> 'postgres' AND
pg_locks.pid=pg_stat_activity.pid AND
pg_stat_activity.query not like '%autovacuum%' AND
pg_stat_activity.query not like '%COPY%stdout%'
order by query_start;
And here's some output we're getting:
<redacted> | ExclusiveLock | virtualxid | t | <redacted> | SELECT current_timestamp | 01:03:51.809594 | 22578
A simple select current_timestamp running for over an hour!!!
Anyway, for those interested it started to look like these mysterious long-running queries were occasionally draining our database connection pool. So we up'd the connection pool limits and the live site is back to humming along. We've got app-side timeouts and retry logic in place on the critical processes to handle the occasional hiccup. And these days we've usually got at least one database thread stuck servicing one of these strangely executing queries. Definitely not ideal :(
We're going to try turning on cost-based auto vacuum and see if this helps the issues at all.
Edit 2:
This one turned out to be a very long journey, which might be at it's end. In response to this behavior, we shored up our batch processing error reporting in addition to the database query monitoring we put in place above. Together with some intelligent timeouts, this enabled us to correlate specific application use cases to the long running database queries. This enabled us to react to errors seen in production to prevent specific usages from hanging a JVM node.
We were also able to solve the issue of why a long running, read-only TX in one process would hang up other processes connecting to the same database. This is where things get a little strange. We were using hibernate-memcached to move hibernate's second level cache into a shared memcached server for all Java processes connecting to the same database. Whenever we got the strange hanging behavior there would be a ton of memcached client threads in the JVM processes.
After removing hibernate-memcached module, moving back to ehcache for second level cache, we noticed the strange multi-JVM debilitating hanging went away. We still get the occasional emails telling us a little more is happening inside a TX than should be. We still get the occasional single JVM process hanging up because it has way too many of these long TXs going on at scale. But we no longer see processes in one JVM somehow affecting other JVMs. Whereas previously we would see additional nodes go unresponsive until we killed the initial node showing bad TX behavior.
Which does not make sense. But then this issue never did :)
-- Tim
First, you need is the excellent Scaling Hibernate Applications with Postgres talk presented by Jim mlodgenski and Bruce Momjian in the JBoss World 2009 to solve the most common problems with Hibernate and PostgreSQL (Caching, replication, Connection pooling, etc). Yo can find it here:
Then, you can send direct queries in plain SQL if you have some trouble with Lazy Loading:
String SQL_QUERY = "SELECT insurance_name, id, invested_amount, avg(i...
+ "invested_amount - avg(invested_amount) OVER(PARTI...
+ "FROM insurance ";
Query query = session.createSQLQuery(SQL_QUERY)
.addScalar("insurance_name", Hibernate.STRING)
.addScalar("id", Hibernate.LONG)
.addScalar("invested_amount", Hibernate.LONG)
.addScalar("a", Hibernate.DOUBLE)
.addScalar("diff", Hibernate.DOUBLE);
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With