Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Django: TextField (string) data compression on database level or code level

I made my Django models and after inserting a test/dummy record to my PostgreSQL database I've realized that my data is quite large for each record. The sum of the data in all fields will be around 700 KB per record. I am estimating I will have around five million records so this will get really large around the 3350 GB mark. Most of my data is big JSON dumps (around 70+ KB for each field).

I am unsure if PostgreSQL will auto compress my data when handling through Django framework. I was wondering whether I should compress my data before entering it into the database.

Questions: Does PostgreSQL auto compress my string fields using some x compression algorithm when using Django model field type TextField?

Should I not rely on PostgreSQL and just compress my data beforehand and then enter it into the DB? If so, which compression library should I use? I already tried zlib in Python and seems great, but, I've read that there is gzip library as well and I am confused which would be the most effective (in terms of compression and decompression speed as well as the percentage of compression).

EDIT: I was reading up on this Django snippet for CompressedTextField which sparked my confusion regarding which compression library to use. I saw a few people use zlib while some used gzip.

EDIT 2: This stackoverflow question says that PostgreSQL does compression of string data automatically.

EDIT 3: PostgreSQL uses pg_lzcompress.c for compression which is a part of the LZ compression family. Is it safe to assume that we don't need to use some other form of compression (zlib or gzip) on the TextField itself since it will be of datatype text (variable length string) in the DB itself?

like image 869
user1757703 Avatar asked Nov 01 '22 21:11

user1757703


1 Answers

Yes, postgresql will compress large text fields, completely independently of any framework you are using it with.

Large field values are stored using something called TOAST. Such attributes may be compressed, and if too large to fit in-line in a column, they are stored out of line in special files called TOAST tables.

As you have already identified LZ compression is used. This does not give as high a compression ratio as some other algorithms. However for the gain you might get I doubt it would be worthwhile to compress the data in your application before sending it to the database if disk space is your major concern.

You can influence the storage of attributes by setting the storage mode for the column. See SET STORAGE on the manual page for ALTER TABLE.

PLAIN must be used for fixed-length values such as integer and is inline, uncompressed. MAIN is for inline, compressible data. EXTERNAL is for external, uncompressed data, and EXTENDED is for external, compressed data. EXTENDED is the default for most data types that support non-PLAIN storage.

The default for TEXT is EXTENDED.

You should give some thought to how your data will be used, though. What type of query will be used to access the data? What filtering criteria will be used? Of it has to read through all these large TOAST attributes to access values used in WHERE clauses then performance is likely to be poor.

like image 173
harmic Avatar answered Nov 15 '22 03:11

harmic