I tested this with 2.14.0 and 2.13.3 I used the JDBC Appender in combination with the DynamicThresholdFilter and tried out a normal Logger and also the AsyncLogger.
In the JDBC Appender i also tried out the PoolingDriver and the ConnectionFactory approach.
It turns out that the Threads are not started parallel because of Log4j2.
Using the AsyncLogger even made it worse since the Output said that the Appender is not started and of 15.000 expected logs only 13.517 are in the DB.
To reproduce the issue i made a github repo see here: https://github.com/stefanwendelmann/Log4j_JDBC_Test
EDIT
I replaced the mssql-jdbc with a h2db and the threads dont block. JMC auto analysis say that there are locking instances of JdbcDatabaseManager.
Is there any configuration problem in my PoolableConnectionFactory for mssql-jdbc or is there a general problem with dbcp / jdbc driver pooling?
Edit 2
Created Ticket on Apaches LOGJ2 Jira: https://issues.apache.org/jira/browse/LOG4J2-3022
Edit 3
Added a longer flight recording for mssql and h2:file
https://github.com/stefanwendelmann/Log4j_JDBC_Test/blob/main/recording_local_docker_mssql_asynclogger_10000_runs.jfr
https://github.com/stefanwendelmann/Log4j_JDBC_Test/blob/main/recording_local_h2_file_asynclogger_10000_runs.jfr
Thanks for getting the flight recordings up. This is a pretty interesting scenario, but I'm afraid I can't give conclusive answers, mostly because for some reason
Here's what I could tell from your recordings:
Both flight recordings show that your CPU was struggling for good chunks of both of your recordings:
JFR even warns in the MSSQL recording that:
An average CPU load of 42 % was caused by other processes during 1 min 17 s starting at 2/18/21 7:28:58 AM.
I noticed that your flight recordings are titled XXXX_10000. If this means "10k concurrent requests", it may simply mean that your machine simply can't deal with the load you're putting on it. You may also benefit from first ensuring that your cores don't have a bunch of other things hogging their time before you kick off another test. At any rate, hitting 100% CPU utilization is bound to cause lock contention as a matter of course, due to context switching. Your flight recording shows that you're running on an 8-core machine; but you noted that you're running a dockerized MSSQL. How many cores did you allocate to Docker?
There's a tonne of blocking in your setup, and there are smoking guns everywhere. The thread identified by Log4j2-TF-1-AsyncLoggerConfig-1
was blocked a lot by the garbage collector, just as the CPU was thrashing:
All but the last 3 ticks across that graph were blockings of the log4j2 thread. There was still significant blocking of the other pooled threads by GC (more on that further down)
The MSSQL flight recording had smoother GC, but both flight recordings featured blocking by GC and the consequent super high CPU utilization. One thing was clear from the MSSQL and H2 recording: every other pooled thread was blocked, waiting for a lock on the same object ID
For MSSQL, lock ID: 0x1304EA4F#40; for H2, lock ID: 0x21A1100D7D0
Every thread except the main
thread and pool-1-thread-1
(which was blocked by garbage collection) exhibits this behavior.
The small specks of green also corroborate the intermittent transfer of monitor locks between the various threads, confirming that they're sort of deadlocked. The pane that shows the threads at the bottom gives a timeline of each thread's blockage. Red indicates blocked; green indicates running. If you hover over each thread's red portion, it shows you that
When you hover over the red slices in your flight recording, you'll see that they're all waiting to acquire the same lock. That lock is intermittently held between the various pooled threads.
In both flight recordings, pool-1-thread-1
is the sole thread that isn't blocked while trying to acquire a lock. That blank row for pool-1-thread-1
is solely due to garbage collection, which I covered earlier.
Ideally, your flight recordings should contain a bunch of thread dumps, especially the one that you ran for over 40 mins; never mind the 20s one. Unfortunately, both recordings contain just 2 recordings each; only one of each of them even contains the stacktrace for pool-1-thread-1
. Singular thread dumps are worthless. You'll need multiple snapshots over a length of time to make use of them. With a thread dump (or a heap dump), one could identify what objects the IDs 0x1304EA4F40
and 0x21A1100D7D0
refer to. The most I could figure out from the dumps is that they're all waiting for an instance of "Object":
It literally could be anything. Your very first flight recording at least showed that the threads were locked on org.apache.logging.log4j.core.appender.db.jdbc.JdbcDatabaseManager
:
That very first recording shows the same pattern in the locks pane, that all the threads were waiting for that single object:
That first recording also shows us what pool-1-thread-1
was up to at that one instant:
From there, I would hazard a guess that that thread was in the middle of closing a database connection? Nothing conclusive can be said until multiple successive thread dumps show the thread activity over a span of time.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With