Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

gzip fails at writing high amount of data in file

I have big gzip compressed files. I have written a piece of code to split those files into smaller ones. I can specify the number of lines per file. The thing is I recently increased the number of line per split to 16,000,000, and when I process bigger files, the split wont happen. Sometimes a smaller file is successfully produced, sometimes one is produced but weighs only 40B or 50B, which is a failure. I tried to catch an exception for that by looking at those raised in the gzip code. So my code looks like this:

def writeGzipFile(file_name, content):
    import gzip
    with gzip.open(file_name, 'wb') as f:
        if not content == '':
            try:
                f.write(content)
            except IOError as ioe:
                print "I/O ERROR wb", ioe.message
            except ValueError as ve:
                print "VALUE ERROR wb: ", ve.message
            except EOFError as eofe:
                print "EOF ERROR wb: ", eofe.message
            except:
                print "UNEXPECTED ERROR wb"

The thing is when the content is too high, related to the number of lines, I often get the "UNEXPECTED ERROR" message. So I have no idea which kind of error is thrown here.

I finally found that the number of lines were the problem, and it appears python's gzip fails at writing such amount of data in one file at once. Lowering the number of lines per split down to 4,000,000 works. However I would like to split the content and to write sequencially to a file to make sure that even high data content get to be written.

So I would like to know how to find out the maximum number of characters that can be written for sure in one go in a file using gzip, without any failure.


EDIT 1

So I caugth all remaining exceptions (I did not know that it was possible to simply catch Exception sorry) :

def writeGzipFile(file_name, content, file_permission=None):
    import gzip, traceback
    with gzip.open(file_name, 'wb') as f:
        if not content == '':
            try:
                f.write(content)
            except IOError as ioe:
                print "I/O ERROR wb", ioe.message
            except ValueError as ve:
                print "VALUE ERROR wb: ", ve.message
            except EOFError as eofe:
                print "EOF ERROR wb: ", eofe.message
            except Exception, err:
                print "EXCEPTION:", err.message
                print "TRACEBACK_1:", traceback.print_exc(file=sys.stdout)
            except:
                print "UNEXPECTED ERROR wb"

The error is about int size. I never thought I would have exceeded the int size one day:

EXCEPTION: size does not fit in an int
TRACEBACK_1:Traceback (most recent call last):
  File "/home/anadin/dev/illumina-project-restructor_mass-splitting/illumina-project-restructor/tools/file_utils/file_compression.py", line 131, in writeGzipFile
    f.write(content)
  File "/usr/local/cluster/python2.7/lib/python2.7/gzip.py", line 230, in write
    self.crc = zlib.crc32(data, self.crc) & 0xffffffffL
OverflowError: size does not fit in an int
None

Ok so the max size of an int being 2,147,483,647, my chunk of data is about 3,854,674,090 according to my log. This chunk is a string to which I applied the __len__() function.

So as I planned to do, and as Antti Haapala suggested, I am about to read smaller chunks at a time in order to sequencially write them to smaller files.

like image 510
kaligne Avatar asked Mar 15 '16 12:03

kaligne


People also ask

Can gzip increase size?

Yes, it can. It has been fixed in . NET 4. The compression algorithms for the System.

What is gzip decompression?

gzip is a file format and a software application used for file compression and decompression. The program was created by Jean-loup Gailly and Mark Adler as a free software replacement for the compress program used in early Unix systems, and intended for use by GNU (the "g" is from "GNU").

What does gzip do to a file?

gzip is a file format used for file compression and decompression. It is based on the Deflate algorithm that allows files to be made smaller in size which allows for faster network transfers.

What compression does gzip use?

The gzip application uses level 6 by default, favoring higher compression over speed. Nginx, on the other hand, uses level 1, favoring higher speeds over file size savings. 75% of websites and nearly all modern browsers and web servers support gzip.


1 Answers

In any case, I suspect the reason is some kind of out-of-memory error. It is quite unclear to me why'd you not write this data smaller amount a time; here using the chunks method from this answer:

def chunks(l, n):
    """Yield successive n-sized chunks from l."""
    for i in xrange(0, len(l), n):
        yield l[i:i+n]

...
with gzip.open(file_name, 'wb') as f:
    for chunk in chunks(content, 65536):
        f.write(chunk)

That is, you do it like you'd eat an elephant, take one bite at a time.