Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Compressing text before storing it in the database

I need to store a very big amount of text in mysql database. It will be millions of records with field type LONGTEXT and database size will be huge.

So, I want ask, if there is a safe way to compress text before storing it into TEXT field to save space, with ability to extract it back if needed?

Something like:

$archived_text = compress_text($huge_text);
// saving $archived_text to database here
// ...

// ...
// getting compressed text from database
$archived_text = get_text_from_db();
$huge_text = uncompress_text($archived_text);

Is there a way to do this with php or mysql? All the texts are utf-8 encoded.

UPDATE

My application is a large literature website where users can add their texts. Here is the table I have:

CREATE TABLE `book_parts` (
  `id` int(11) NOT NULL AUTO_INCREMENT,
  `book_id` int(11) NOT NULL,
  `title` varchar(200) DEFAULT NULL,
  `content` longtext,
  `order_num` int(11) DEFAULT NULL,
  `views` int(10) unsigned DEFAULT '0',
  `add_date` datetime DEFAULT NULL,
  `is_public` tinyint(3) unsigned NOT NULL DEFAULT '1',
  `published_as_draft` tinyint(3) unsigned NOT NULL DEFAULT '0',
  PRIMARY KEY (`id`),
  KEY `key_order_num` (`order_num`),
  KEY `add_date` (`add_date`),
  KEY `key_book_id` (`book_id`,`is_public`,`order_num`),
  CONSTRAINT FOREIGN KEY (`book_id`) REFERENCES `books` (`id`) ON DELETE CASCADE
) ENGINE=InnoDB DEFAULT CHARSET=utf8 

Currently it has about 800k records and weights 4 GB, 99% of queries are SELECT. I have all reasons to think that numbers increase diagrammatically. I wouldn't like to store texts in the files because there is quite heavy logic around and my website has quite a few hits.

like image 699
Silver Light Avatar asked Nov 22 '11 15:11

Silver Light


People also ask

What does compressing a database do?

Database compression is a set of techniques that reorganizes database content to save on physical storage space and improve performance speeds. Compression can be achieved in two primary ways: Lossless: Original data can be fully reconstructed from the compressed data.

How is a text file compressed?

Text compression typically works by finding similar strings within a text file, and replacing those strings with a temporary binary representation to make the overall file size smaller.

What is compress in SQL?

The COMPRESS function compresses the input expression data. You must invoke this function for each data section to compress. See Data Compression for more information about automatic data compression during storage at the row or page level.

How do I compress a string in SQL?

The MySQL COMPRESS() function is used for the compression of a string. The value returned by the COMPRESS() function is a binary string. The COMPRESS() function stores non-empty strings as a four-byte length of the uncompressed string, which is then followed by the compressed string.


2 Answers

Are you going to index these texts. How big is read load on this texts? Insert load?

You can use InnoDB data compression - transparent and modern way. See docs for more info.

If you have realy huge texts (say, each text is above 10MB), than good idea is not to store them in Mysql. Store compressed by gzip texts in file system and only pointers and meta in mysql. You can easily expand your storage in future and move it to e.g. DFS.

Update: another plus of storing texts outside Mysql: DB stays small and fast. Minus: high probability of data inconsistence.

Update 2: if you have much programming resourses, please, take a look on projects like this one: http://code.google.com/p/mysql-filesystem-engine/.

Final Update: according to your info, you can just use InnoDB compression - it is the same as ZIP. You can start with these params:

CREATE TABLE book_parts
 (...) 
 ENGINE=InnoDB
 ROW_FORMAT=COMPRESSED 
 KEY_BLOCK_SIZE=8;

Later you will need to play with KEY_BLOCK_SIZE. See SHOW STATUS LIKE 'COMPRESS_OPS_OK' and SHOW STATUS LIKE 'COMPRESS_OPS'. Ratio of these two params must be close to 1.0: Docs.

like image 83
Oroboros102 Avatar answered Sep 28 '22 07:09

Oroboros102


If you're compressing (eg. gzip), then don't use TEXT fields of any sort. They're not binary-safe. Data going into/coming out of text fields is subject to character set translation, which probably (though not necessarily) mangle the compressed data and give you a corrupted result when you retrieve/uncompress the text.

Use BLOB fields instead, which are binary-transparent and do not to any translation of the data.

like image 44
Marc B Avatar answered Sep 28 '22 07:09

Marc B