Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Using bz2.BZ2Decompressor

I am running Python 3.6.4 on Windows 10 with Fall Creators update. I am attempting to decompress a Wikimedia data dump file, specifically https://dumps.wikimedia.org/enwiktionary/latest/enwiktionary-latest-pages-meta-current.xml.bz2.

This file decompresses without problems using 7z on the command line but fails on the first block of data with zero length output from the Python decompressor. The code follows:

import bz2

def decompression(qin,                 # Iterable supplying input bytes data
                  qout):               # Pipe to next process - needs bytes data
    decomp = bz2.BZ2Decompressor()     # Create a decompressor
    for chunk in qin:                  # Loop obtaining data from source iterable
        lc = len(chunk)                # = 16384
        dc = decomp.decompress(chunk)  # Do the decompression
        ldc = len(dc)                  # = 0
        qout.put(dc)                   # Pass the decompressed chunk to the next process

I have verified that the bz2 header is valid and since the file decompresses without problems using command line utilities, the problem seems to be related to the Python implementation of BZ2. The following values from the decompressor seem OK and match what you would expect given the documentation.

eof = False
unused_data = b''
needs_input = True

Any suggestions on how to troubleshoot this problem?

like image 840
Jonathan Avatar asked Oct 20 '25 05:10

Jonathan


1 Answers

Beats me. I can't find anything wrong with your function. The following works on the linked .bz2 file with no issue, where the output exactly matches the result of a command-line decompression of that .bz2 file:

import sys
import bz2

def decompression(qin,                 # Iterable supplying input bytes data
                  qout):               # Pipe to next process - needs bytes data
    decomp = bz2.BZ2Decompressor()     # Create a decompressor
    for chunk in qin:                  # Loop obtaining data from source iterable
        lc = len(chunk)                # = 16384
        dc = decomp.decompress(chunk)  # Do the decompression
        # qout.put(dc)                   # Pass the decompressed chunk to the next process
        qout.write(dc)

with open('enwiktionary-latest-pages-meta-current.xml.bz2', 'rb') as f:
    it = iter(lambda: f.read(16384), b'')
    decompression(it, sys.stdout.buffer)

I only made one trivial change to your function in order to write the result to stdout. I am using Python 3.6.4. I also tried it with Python 2.7.10 (removing the .buffer), and it again worked flawlessly.

Are you actually just letting your function run? What do you mean by "fails on the first block"? The first few calls (seven in this case) will in fact return no decompressed data, because you have not yet provided a complete block for it to work on. But there are no errors reported.

Note: to do this right for .bz2 files that contain concatenated bzip2 streams, you would need to loop on eof true, creating a new decompressor object and feeding in the unused_data from the previous decompressor object, followed by more data read from the compressed file. The linked file isn't one of those.

like image 199
Mark Adler Avatar answered Oct 22 '25 05:10

Mark Adler