Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python gzip CRC check failed

I have a folder with huge text files. Each one is gzipped and weighs several Giga byte. I wrote a piece of code to split the content of each gzip file: each gzip file is open with gzip, then every specified chunk of line is read and written to a new gzip file.

Here is the code, in file file_compression.py:

import sys, os, file_manipulation as fm
import gzip


def splitGzipFile(fileName, dest=None, chunkPerSplit=100, linePerChunk=4, file_field_separator="_", zfill=3
                  , verbose=False, file_permission=None, execute=True):
    """
    Splits a gz file into chunk files.
    :param fileName:
    :param chunkPerSplit:
    :param linePerChunk:
    :return:
    """
    absPath = os.path.abspath(fileName)
    baseName = os.path.basename(absPath)
    dirName = os.path.dirname(absPath)
    destFolder = dirName if dest is None else dest


    ## Compute file fields
    rawBaseName, extensions = baseName.split(os.extsep, 1)

    if not str(extensions).startswith("."):
        extensions = "." + extensions

    file_fields = str(rawBaseName).split(file_field_separator)
    first_fields = file_fields[:-1] if file_fields.__len__() > 1 else file_fields
    first_file_part = file_field_separator.join(first_fields)
    last_file_field = file_fields[-1] if file_fields.__len__() > 1 else ""
    current_chunk = getCurrentChunkNumber(last_file_field)
    if current_chunk is None or current_chunk < 0:
        first_file_part = rawBaseName

    ## Initialize chunk variables
    linePerSplit = chunkPerSplit * linePerChunk
    # chunkCounter = 0

    chunkCounter = 0 if current_chunk is None else current_chunk-1

    for chunk in getFileChunks(fileName, linePerSplit):
        print "writing " + str(str(chunk).__len__()) + " ..."
        chunkCounter += 1
        oFile = fm.buildPath(destFolder) + first_file_part + file_field_separator + str(chunkCounter).zfill(zfill) + extensions

        if execute:
            writeGzipFile(oFile, chunk, file_permission)
        if verbose:
            print "Splitting: created file ", oFile



def getCurrentChunkNumber(chunk_field):
    """
    Tries to guess an integer from a string.
    :param chunk_field:
    :return: an integer, None if failure.
    """
    try:
        return int(chunk_field)
    except ValueError:
        return None


def getFileChunks(fileName, linePerSplit):
    with gzip.open(fileName, 'rb') as f:
        print "gzip open"
        lineCounter = 0
        currentChunk = ""
        for line in f:
            currentChunk += line
            lineCounter += 1
            if lineCounter >= linePerSplit:
                yield currentChunk
                currentChunk = ""
                lineCounter = 0
        if not currentChunk == '':
            yield currentChunk


def writeGzipFile(file_name, content, file_permission=None):
    import gzip
    with gzip.open(file_name, 'wb') as f:
        if not content == '':
            f.write(content)

    if file_permission is not None and type(file_permission) == int:
        os.chmod(file_name, file_permission)

This task is multiprocess, a process is created for each file before being splitted. Each file is open and split only once, before being erased, I made sure of that by recording them in a list:

from tools.file_utils import file_compression as fc, file_manipulation as fm
import multiprocessing
from multiprocessing import Process, Queue, Manager

manager = Manager()
split_seen = manager.list()

files = [...] # list is full of gzip files.
processList = []
sampleDir = "sample/dir/"

for file in files:
    fielPath = sampleDir + str(file)
    p = Process(target=processFile, args=(filePath, sampleDir, True))
    p.start()
    processList.append(p)

## Join the processes
for p in processList:
    p.join()

def processFile(filePath, destFolder, verbose=True):
    global split_seen
    if filePath in split_seen:
        print "Duplicate file processed: " + str(filePath)
        time.sleep(3)
    print "adding", filePath, split_seen.__len__()
    split_seen.append(filePath)
    fc.splitGzipFile(filePath, dest=destFolder, chunkPerSplit=4000000\
                                 , linePerChunk=4
                                 , verbose=True
                                 , file_permission=0770
                                 , zfill=3
                         )

    os.remove(filePath)

So far the code has always run fine. But today I had an issue with gzip files' CRC corruption:

Process Process-3:72:

Traceback (most recent call last):

  ...

  File "/.../tools/file_utils/file_compression.py", line 43, in splitGzipFile

    for chunk in getFileChunks(fileName, linePerSplit):

  File "/.../tools/file_utils/file_compression.py", line 70, in getFileChunks

    for line in f:

  File "/.../python2.7/lib/python2.7/gzip.py", line 450, in readline

    c = self.read(readsize)

  File "/.../python2.7/lib/python2.7/gzip.py", line 256, in read

    self._read(readsize)

  File "/.../python2.7/lib/python2.7/gzip.py", line 320, in _read

    self._read_eof()

  File "/.../python2.7/lib/python2.7/gzip.py", line 342, in _read_eof

    hex(self.crc)))

IOError: CRC check failed 0xddbb6045 != 0x34fd5580L

What could be the origins for this issue? I have to state again that so far it has always worked, folders and files are always of the same structure. The difference in this instance perhaps is that my script is processing more gzip files than usual, maybe twice as much.

Could it be a matter of the same files being accessed at the same time? But that I seriously doubt, I made sure it is not the case by registering each file accessed in my split_seen list.

I would take any hint, as I have no more clues to where to look.


EDIT 1

Maybe some open files were accessed by someone else, or another program? I cannot ask for and rely on testimonials. So as a start, if I were to put a multiprocess.Lock, would it prevent any other thread, process, program, user, etc from modifying the file? Or is it only limited to Python? I cannot find any doc on that.

like image 624
kaligne Avatar asked Oct 31 '22 09:10

kaligne


2 Answers

I got the exact same error on code that has been running for months. Turns out that the file source was corrupted for that particular file. I went back to an old file and it worked fine and I used a newer file and it also worked fine.

like image 188
Carl Avatar answered Nov 03 '22 00:11

Carl


I had the same issue. I just deleted the old file re-ran the code.

rm -rf /tmp/imagenet/

HTH

like image 43
Swathi Avatar answered Nov 02 '22 23:11

Swathi