Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Using Docker, what triggered PANIC: could not locate a valid checkpoint record

Tags:

I am trying to understand Docker a little better, and in doing so, it appears I corrupted my PostgreSQL DB for my application.

I am using Docker Swarm to start my application and I'm getting the following error in a loop in the PostgreSQL Container:

    2021-02-10 15:38:51.304 UTC 120 LOG:  database system was shut down at 2021-02-10 14:49:14 UTC
    2021-02-10 15:38:51.304 UTC 120 LOG:  invalid primary checkpoint record
    2021-02-10 15:38:51.304 UTC 120 LOG:  invalid secondary checkpoint record
    2021-02-10 15:38:51.304 UTC 120 PANIC:  could not locate a valid checkpoint record
    2021-02-10 15:38:51.447 UTC 1 LOG:  startup process (PID 120) was terminated by signal 6
    2021-02-10 15:38:51.447 UTC 1 LOG:  aborting startup due to startup process failure
    2021-02-10 15:38:51.455 UTC 1 LOG:  database system is shut down

Initially, I was trying to modify the pg_hba.conf file in the container by going to the mount drive in the FS, which is in

 /var/lib/docker/volumes/postgres96-data-volume/_data

However, every time I restarted the container my changes to pg_hba.conf were reverted. So this morning I added a dummy file called test in the mount folder and restarted the container expecting the file to be deleted to get a visual validation that restarting the container automatically replaces everything in that mount to it's original format. After restarting it again, that's when I started getting those error messages preventing my application from starting.

I deleted the test file and restarted the container again, but the error message continues.

I read many solutions on how to fix it, but my question is more to understand why adding a file would cause that? Is my volume corrupted simply because I added a file in there?

Thanks

like image 372
Awsmike Avatar asked Feb 10 '21 15:02

Awsmike


2 Answers

WARNING

For the people who jump onto using the solution in the accepted answer, here's your WARNING:

The solution in the accepted answer asks to remove the docker volume which means that all the data in the PostgreSQL instance will be lost!!!

Refer to my answer here if you wish to preserve the data of the database instance.

Context in which I faced the same error

I am also using docker swarm to deploy containers and recently encountered this issue when I tried to scale the postgres db to create 2 replicas, both pointing to the same physical volume (mounted using docker, shared using NFS). This was needed so that the data is in sync across both replicas. But this led me to the same error as you have

PANIC: could not locate a valid checkpoint record

My findings

Firstly, the database volume is not corrupted, just the transaction WAL has corrupted or it has lost consensus. I did a lot of digging on it. I found two scenarios in which this error may occur:

  1. The database was executing a live transaction but suddenly it shut down due to some error. In this case, the WAL tells the database what it was supposed to be doing when it unexpectedly shut down. However, if the DB shut down during a WAL update, the WAL may reflect some transactions which were actually executed but have improper execution info. This leads to an inconsistency in DB data vs WAL or a corrupt transaction log which leads to a checkpoint error.

  2. You create multiple replicas of the db which point to the same volume. Consider the case of 2 replicas that I faced. When both replicas simultaneously try to execute a transaction on the same db volume, the transaction WAL loses consensus as there are two simultaneous checkpoints. The db fails to execute any further transactions as it is unable to determine which checkpoint to consider as the correct one. This can also happen if two containers (not necessarily replicas) point to the same mount path for PG_DATA.

Eventually, the db fails to start. The container does not start as the db throws an error which closes the container.

You may reset the WAL to fix this issue. When WAL is reset, you will lose the data for transactions that are yet to be executed on the DB. However, data that is already written and transactions that are already processed are preserved.

like image 154
palc Avatar answered Nov 12 '22 00:11

palc


This error means the Postgres volume is corrupted. This can happen when two containers try to connect to the same volume at the same time. See this answer for slightly more info. Not sure how modifying a file corrupted the drive. You'll need to delete and recreate the volume though. To do this you can:

$ docker stop <your_container_name> # stops a running container
$ docker image prune # removes all images that are not attached to a container
$ docker volume ls # list out active volumes
$ docker volume rm <volume_name> # Remove the volume that's corrupted

I had to run the above code to stop a container, clean images that somehow weren't attached to any containers and then finally delete the offending volume where corrupted data was held.

like image 21
Connor Leech Avatar answered Nov 12 '22 02:11

Connor Leech