Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Concurrency logging to sql DB - threads not running parallel

I tested this with 2.14.0 and 2.13.3 I used the JDBC Appender in combination with the DynamicThresholdFilter and tried out a normal Logger and also the AsyncLogger.

In the JDBC Appender i also tried out the PoolingDriver and the ConnectionFactory approach.

It turns out that the Threads are not started parallel because of Log4j2.

Using the AsyncLogger even made it worse since the Output said that the Appender is not started and of 15.000 expected logs only 13.517 are in the DB.

To reproduce the issue i made a github repo see here: https://github.com/stefanwendelmann/Log4j_JDBC_Test

Threads

EDIT

I replaced the mssql-jdbc with a h2db and the threads dont block. JMC auto analysis say that there are locking instances of JdbcDatabaseManager.

Is there any configuration problem in my PoolableConnectionFactory for mssql-jdbc or is there a general problem with dbcp / jdbc driver pooling?

Edit 2

Created Ticket on Apaches LOGJ2 Jira: https://issues.apache.org/jira/browse/LOG4J2-3022

Edit 3

Added a longer flight recording for mssql and h2:file

https://github.com/stefanwendelmann/Log4j_JDBC_Test/blob/main/recording_local_docker_mssql_asynclogger_10000_runs.jfr

https://github.com/stefanwendelmann/Log4j_JDBC_Test/blob/main/recording_local_h2_file_asynclogger_10000_runs.jfr

like image 397
Stefan Höltker Avatar asked Nov 06 '22 02:11

Stefan Höltker


1 Answers

Thanks for getting the flight recordings up. This is a pretty interesting scenario, but I'm afraid I can't give conclusive answers, mostly because for some reason

  1. The information in your flight recordings is weirdly incomplete. I'll explain a little more shortly
  2. There seems to be other things going on in your system that may be muddying the diagnosis. You might benefit from killing any other running process in your machine.

So, what now/(TL;DR)

  1. You need to be sure that your connection source to the database is pooled
  2. Make sure you start your load test on a calm, clear-headed CPU
  3. Configure your next flight recording to take sufficient, intermittent thread dumps. This is probably the most important next step, if you're interested in figuring out what exactly all these threads are waiting for. Don't post up another flight recording until you're positive it contains multiple thread dumps that feature all the live threads in your JVM.
  4. Maybe 10k threads isn't reasonable for your local machine
  5. I also noticed from the flight recording that you have a heap size maxed at 7GB. If you're not on a 64bit OS, that could actually be harmful. A 32-bit OS can address a max of 4GB.
  6. Make sure there aren't any actual database failures causing the whole thing to thrash. Are you running out of connections? Are there any SQLExceptions blowing up somewhere? Any exceptions at all?

Here's what I could tell from your recordings:

CPU

Both flight recordings show that your CPU was struggling for good chunks of both of your recordings:

The MSSQL recording (46 mins total)

enter image description here

JFR even warns in the MSSQL recording that:

An average CPU load of 42 % was caused by other processes during 1 min 17 s starting at 2/18/21 7:28:58 AM.

The H2 recording (20.3s total)enter image description here

I noticed that your flight recordings are titled XXXX_10000. If this means "10k concurrent requests", it may simply mean that your machine simply can't deal with the load you're putting on it. You may also benefit from first ensuring that your cores don't have a bunch of other things hogging their time before you kick off another test. At any rate, hitting 100% CPU utilization is bound to cause lock contention as a matter of course, due to context switching. Your flight recording shows that you're running on an 8-core machine; but you noted that you're running a dockerized MSSQL. How many cores did you allocate to Docker?

Blocked Threads

There's a tonne of blocking in your setup, and there are smoking guns everywhere. The thread identified by Log4j2-TF-1-AsyncLoggerConfig-1 was blocked a lot by the garbage collector, just as the CPU was thrashing:

The H2 flight recording:

enter image description here

All but the last 3 ticks across that graph were blockings of the log4j2 thread. There was still significant blocking of the other pooled threads by GC (more on that further down)

The MSSQL flight recording had smoother GC, but both flight recordings featured blocking by GC and the consequent super high CPU utilization. One thing was clear from the MSSQL and H2 recording: every other pooled thread was blocked, waiting for a lock on the same object ID

For MSSQL, lock ID: 0x1304EA4F#40; for H2, lock ID: 0x21A1100D7D0

Every thread except the main thread and pool-1-thread-1 (which was blocked by garbage collection) exhibits this behavior.

These 7 threads are all waiting for the same object. There is definitely some blocking or even a deadlock somewhere in your setup.

The small specks of green also corroborate the intermittent transfer of monitor locks between the various threads, confirming that they're sort of deadlocked. The pane that shows the threads at the bottom gives a timeline of each thread's blockage. Red indicates blocked; green indicates running. If you hover over each thread's red portion, it shows you that

  1. The thread is blocked, waiting to acquire a lock (RED)
  2. The ID of the lock that the thread is trying to acquire and is currently unable
  3. The ID of the thread that last held the lock
  4. Green indicates a running, unblocked thread.

When you hover over the red slices in your flight recording, you'll see that they're all waiting to acquire the same lock. That lock is intermittently held between the various pooled threads.

MSSQL (threads blocked waiting for 0x1304EA4F40):

enter image description here

H2 (threads blocked waiting for 0x21A1100D7D0):

enter image description here

In both flight recordings, pool-1-thread-1 is the sole thread that isn't blocked while trying to acquire a lock. That blank row for pool-1-thread-1 is solely due to garbage collection, which I covered earlier.

Dumps

Ideally, your flight recordings should contain a bunch of thread dumps, especially the one that you ran for over 40 mins; never mind the 20s one. Unfortunately, both recordings contain just 2 recordings each; only one of each of them even contains the stacktrace for pool-1-thread-1. Singular thread dumps are worthless. You'll need multiple snapshots over a length of time to make use of them. With a thread dump (or a heap dump), one could identify what objects the IDs 0x1304EA4F40 and 0x21A1100D7D0 refer to. The most I could figure out from the dumps is that they're all waiting for an instance of "Object":

enter image description here

It literally could be anything. Your very first flight recording at least showed that the threads were locked on org.apache.logging.log4j.core.appender.db.jdbc.JdbcDatabaseManager:

enter image description here

That very first recording shows the same pattern in the locks pane, that all the threads were waiting for that single object:

enter image description here

That first recording also shows us what pool-1-thread-1 was up to at that one instant:

enter image description here

From there, I would hazard a guess that that thread was in the middle of closing a database connection? Nothing conclusive can be said until multiple successive thread dumps show the thread activity over a span of time.

like image 94
kolossus Avatar answered Nov 12 '22 11:11

kolossus