Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Piecemeal bzcompression for large files in PHP

Creating bzip2 archived data in PHP is very easy thanks to its implementation in bzcompress. In my present application I cannot in all reason simply read the input file into a string and then call bzcompress or bzwrite. The PHP documentation does not make it clear whether successive calls to bzwrite with relatively small amounts of data will yield the same result as when compressing the whole file in one single swoop. I mean something along the lines of

$data = file_get_contents('/path/to/bigfile');
$cdata = bzcompress($data);

I tried out a piecemeal bzcompression using the routines shown below

function makeBZFile($infile,$outfile)
{
 $fp = fopen($infile,'r');
 $bz = bzopen($outfile,'w');
 while (!feof($fp))     
 {
  $bytes = fread($fp,10240);
  bzwrite($bz,$bytes);
 }
 bzclose($bz);
 fclose($fp);
}

function unmakeBZFile($infile,$outfile)
{
 $bz = bzopen($infile,'r');
 while (!feof($bz))
 {
  $str = bzread($bz,10240);
  file_put_contents($outfile,$str,FILE_APPEND);
 }
}

set_time_limit(1200);
makeBZFile('/tmp/test.rnd','/tmp/test.bz');
unmakeBZFile('/tmp/test.bz','/tmp/btest.rnd'); 

To test this code I did two things

  • I used makeBZFile and unmakeBZFile to compress and then decompress a SQLite database - which is what I need to do eventually.
  • I created a 50Mb filled with random data dd if=/dev/urandom of='/tmp.test.rnd bs=50M count=1

In both cases I performed a diff original.file decompressed.file and found that the two were identical.

All very nice but it is not clear to me why this is working. The PHP docs state that bzread(bzpointer,length) reads a maximum length bytes of UNCOMPRESSED data. If my code below is woring it is because I am forcing the bzwite and bzread size to 10240 bytes.

What I cannot see is just how bzread knows how to fetch lenth bytes of UNCOMPRESSED data. I checked out the format of a bzip2 file. I cannot see tht there is anything there which helps easily establish the uncompressed data length for a chunk of the .bz file.

I suspect there is a gap in my understanding of how this works - or else the fact that my code below appears to perform a correct piecemeal compression is purely accidental.

I'd much appreciate a few explanations here.

like image 523
DroidOS Avatar asked Dec 10 '15 09:12

DroidOS


1 Answers

To understand how the decompression get the length of bytes you have to understand first the compression. It seems that you don't know any thing about compression algorigthim.

BZIP2

Crucial algorithm of BZIP2 is the Burrows Wheeler transformation (BWT), that converts the original data into a suitable form for following coding. The current version applies a Huffman code. Compression algorithm processes the data in blocks totally independent from each block. Block sizes can be set in a range from 1-9 (100,000 - 900,000 bytes).

BZIP2 Data Structure

The first two character of compressed string start with letter 'BZ' and thereafter 1 byte for algorigthim used. Thereafter identification of the block size immediately follows, being valid for the entire file (h1, h2, h3 to h9). The parameter indicates the block size in units from 1-9 (100,000 - 900,000 bytes).

Actual original data are stored in blocks according to the selected size and will be protected individually with a CRC32 checksum. Additionally a 48 bit identifier introduces each block. This block structure allows a partial reconstruction of damaged files.

GZIP/BZIP

Gzip and bzip2 are functionally equivalent. One advantage of GZIP is that it can compress a stream, a sequence where you can't look behind. This makes it the official compressor of http streams. GZZIP DEFLATE RFC 1951 Compressed Data Format Specification and GUNZIP RFC 1952 File Format Specification are published documents.

GIP explained

GZIP Explained

like image 126
Vineet1982 Avatar answered Oct 17 '22 19:10

Vineet1982