Creating bzip2 archived data in PHP is very easy thanks to its implementation in bzcompress. In my present application I cannot in all reason simply read the input file into a string and then call bzcompress
or bzwrite
. The PHP documentation does not make it clear whether successive calls to bzwrite
with relatively small amounts of data will yield the same result as when compressing the whole file in one single swoop. I mean something along the lines of
$data = file_get_contents('/path/to/bigfile');
$cdata = bzcompress($data);
I tried out a piecemeal bzcompression using the routines shown below
function makeBZFile($infile,$outfile)
{
$fp = fopen($infile,'r');
$bz = bzopen($outfile,'w');
while (!feof($fp))
{
$bytes = fread($fp,10240);
bzwrite($bz,$bytes);
}
bzclose($bz);
fclose($fp);
}
function unmakeBZFile($infile,$outfile)
{
$bz = bzopen($infile,'r');
while (!feof($bz))
{
$str = bzread($bz,10240);
file_put_contents($outfile,$str,FILE_APPEND);
}
}
set_time_limit(1200);
makeBZFile('/tmp/test.rnd','/tmp/test.bz');
unmakeBZFile('/tmp/test.bz','/tmp/btest.rnd');
To test this code I did two things
makeBZFile
and unmakeBZFile
to compress and then decompress a SQLite database - which is what I need to do eventually.dd if=/dev/urandom of='/tmp.test.rnd bs=50M count=1
In both cases I performed a diff original.file decompressed.file
and found that the two were identical.
All very nice but it is not clear to me why this is working. The PHP docs state that bzread(bzpointer,length)
reads a maximum length
bytes of UNCOMPRESSED data. If my code below is woring it is because I am forcing the bzwite
and bzread
size to 10240 bytes.
What I cannot see is just how bzread
knows how to fetch lenth
bytes of UNCOMPRESSED data. I checked out the format of a bzip2 file. I cannot see tht there is anything there which helps easily establish the uncompressed data length for a chunk of the .bz file.
I suspect there is a gap in my understanding of how this works - or else the fact that my code below appears to perform a correct piecemeal compression is purely accidental.
I'd much appreciate a few explanations here.
To understand how the decompression get the length of bytes you have to understand first the compression. It seems that you don't know any thing about compression algorigthim.
Crucial algorithm of BZIP2 is the Burrows Wheeler transformation (BWT)
, that converts the original data into a suitable form for following coding. The current version applies a Huffman code
. Compression algorithm processes the data in blocks totally independent from each block. Block sizes can be set in a range from 1-9 (100,000 - 900,000 bytes).
The first two character of compressed string start with letter 'BZ' and thereafter 1 byte for algorigthim used. Thereafter identification of the block size immediately follows, being valid for the entire file (h1
, h2
, h3
to h9
). The parameter indicates the block size in units from 1-9 (100,000 - 900,000 bytes).
Actual original data are stored in blocks according to the selected size and will be protected individually with a CRC32 checksum. Additionally a 48 bit identifier introduces each block. This block structure allows a partial reconstruction of damaged files.
Gzip and bzip2 are functionally equivalent. One advantage of GZIP is that it can compress a stream, a sequence where you can't look behind. This makes it the official compressor of http streams. GZZIP DEFLATE RFC 1951 Compressed Data Format Specification and GUNZIP RFC 1952 File Format Specification are published documents.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With