On production our database is a few hundred gigabytes in size. For development and testing, we need to create snapshots of this database that are functionally equivalent, but which are only 10 or 20 gigs in size.
The challenge is that the data for our business entities are scattered across many tables. We want to create some sort of filtered snapshot so that only some of the entities are included in the dump. That way we can get fresh snapshots every month or so for dev and testing.
For example, let's say we have entities that have these many-to-many relationships:
There are maybe 1000 companies, 2500 divisions, 175000 employees, and tens of millions of attendance records. We want a replicable way to pull, say, the first 100 companies and all of its constituent divisions, employees, and attendance records.
We currently use pg_dump for the schema, and then run pg_dump with --disable-triggers and --data-only to get all the data out of the smaller tables. We don't want to have to write custom scripts to pull out part of the data because we have a fast development cycle and are concerned the custom scripts would be fragile and likely to be out of date.
How can we do this? Are there third-party tools that can help pull out logical partitions from the database? What are these tools called?
Any general advice also appreciated!
To backup a specific table, use the –table TABLENAME option in the pg_dump command. If there are same table names in different schema then use the –schema SCHEMANAME option. This is an example of backing up a specific Postgres database.
pg_dump is a utility for backing up a PostgreSQL database. It makes consistent backups even if the database is being used concurrently. pg_dump does not block other users accessing the database (readers or writers).
On your larger tables you can use the COPY command to pull out subsets...
COPY (SELECT * FROM mytable WHERE ...) TO '/tmp/myfile.tsv' COPY mytable FROM 'myfile.tsv'
https://www.postgresql.org/docs/current/static/sql-copy.html
You should consider maintaining a set of development data rather than just pulling a subset of your production. In the case that you're writing unit tests, you could use the same data that is required for the tests, trying to hit all of the possible use cases.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With