I have big gzip compressed files. I have written a piece of code to split those files into smaller ones. I can specify the number of lines per file. The thing is I recently increased the number of line per split to 16,000,000, and when I process bigger files, the split wont happen. Sometimes a smaller file is successfully produced, sometimes one is produced but weighs only 40B or 50B, which is a failure. I tried to catch an exception for that by looking at those raised in the <code>gzip</code> code. So my code looks like this: <pre class="prettyprint"><code>def writeGzipFile(file_name, content): import gzip with gzip.open(file_name, 'wb') as f: if not content == '': try: f.write(content) except IOError as ioe: print "I/O ERROR wb", ioe.message except ValueError as ve: print "VALUE ERROR wb: ", ve.message except EOFError as eofe: print "EOF ERROR wb: ", eofe.message except: print "UNEXPECTED ERROR wb" </code></pre> The thing is when the content is too high, related to the number of lines, I often get the "UNEXPECTED ERROR" message. So I have no idea which kind of error is thrown here. I finally found that the number of lines were the problem, and it appears python's <code>gzip</code> fails at writing such amount of data in one file at once. Lowering the number of lines per split down to 4,000,000 works. However I would like to split the content and to write sequencially to a file to make sure that even high data content get to be written. So I would like to know how to find out the maximum number of characters that can be written for sure in one go in a file using <code>gzip</code>, without any failure. <hr> EDIT 1 So I caugth all remaining exceptions (I did not know that it was possible to simply catch <code>Exception</code> sorry) : <pre class="prettyprint"><code>def writeGzipFile(file_name, content, file_permission=None): import gzip, traceback with gzip.open(file_name, 'wb') as f: if not content == '': try: f.write(content) except IOError as ioe: print "I/O ERROR wb", ioe.message except ValueError as ve: print "VALUE ERROR wb: ", ve.message except EOFError as eofe: print "EOF ERROR wb: ", eofe.message except Exception, err: print "EXCEPTION:", err.message print "TRACEBACK_1:", traceback.print_exc(file=sys.stdout) except: print "UNEXPECTED ERROR wb" </code></pre> The error is about <code>int</code> size. I never thought I would have exceeded the int size one day: <pre class="prettyprint"><code>EXCEPTION: size does not fit in an int TRACEBACK_1:Traceback (most recent call last): File "/home/anadin/dev/illumina-project-restructor_mass-splitting/illumina-project-restructor/tools/file_utils/file_compression.py", line 131, in writeGzipFile f.write(content) File "/usr/local/cluster/python2.7/lib/python2.7/gzip.py", line 230, in write self.crc = zlib.crc32(data, self.crc) & 0xffffffffL OverflowError: size does not fit in an int None </code></pre> Ok so the max size of an int being 2,147,483,647, my chunk of data is about 3,854,674,090 according to my log. This chunk is a string to which I applied the <code>__len__()</code> function. So as I planned to do, and as Antti Haapala suggested, I am about to read smaller chunks at a time in order to sequencially write them to smaller files.

In any case, I suspect the reason is some kind of out-of-memory error. It is quite unclear to me why'd you not write this data smaller amount a time; here using the <code>chunks</code> method from this answer: <pre class="prettyprint"><code>def chunks(l, n): """Yield successive n-sized chunks from l.""" for i in xrange(0, len(l), n): yield l[i:i+n] ... with gzip.open(file_name, 'wb') as f: for chunk in chunks(content, 65536): f.write(chunk) </code></pre> That is, you do it like you'd eat an elephant, take one bite at a time.

gzip fails at writing high amount of data in file

Tags:

python

gzip

python-2.7

I have big gzip compressed files. I have written a piece of code to split those files into smaller ones. I can specify the number of lines per file. The thing is I recently increased the number of line per split to 16,000,000, and when I process bigger files, the split wont happen. Sometimes a smaller file is successfully produced, sometimes one is produced but weighs only 40B or 50B, which is a failure. I tried to catch an exception for that by looking at those raised in the gzip code. So my code looks like this:

def writeGzipFile(file_name, content):
    import gzip
    with gzip.open(file_name, 'wb') as f:
        if not content == '':
            try:
                f.write(content)
            except IOError as ioe:
                print "I/O ERROR wb", ioe.message
            except ValueError as ve:
                print "VALUE ERROR wb: ", ve.message
            except EOFError as eofe:
                print "EOF ERROR wb: ", eofe.message
            except:
                print "UNEXPECTED ERROR wb"

The thing is when the content is too high, related to the number of lines, I often get the "UNEXPECTED ERROR" message. So I have no idea which kind of error is thrown here.

I finally found that the number of lines were the problem, and it appears python's gzip fails at writing such amount of data in one file at once. Lowering the number of lines per split down to 4,000,000 works. However I would like to split the content and to write sequencially to a file to make sure that even high data content get to be written.

So I would like to know how to find out the maximum number of characters that can be written for sure in one go in a file using gzip, without any failure.

EDIT 1

So I caugth all remaining exceptions (I did not know that it was possible to simply catch Exception sorry) :

def writeGzipFile(file_name, content, file_permission=None):
    import gzip, traceback
    with gzip.open(file_name, 'wb') as f:
        if not content == '':
            try:
                f.write(content)
            except IOError as ioe:
                print "I/O ERROR wb", ioe.message
            except ValueError as ve:
                print "VALUE ERROR wb: ", ve.message
            except EOFError as eofe:
                print "EOF ERROR wb: ", eofe.message
            except Exception, err:
                print "EXCEPTION:", err.message
                print "TRACEBACK_1:", traceback.print_exc(file=sys.stdout)
            except:
                print "UNEXPECTED ERROR wb"

The error is about int size. I never thought I would have exceeded the int size one day:

EXCEPTION: size does not fit in an int
TRACEBACK_1:Traceback (most recent call last):
  File "/home/anadin/dev/illumina-project-restructor_mass-splitting/illumina-project-restructor/tools/file_utils/file_compression.py", line 131, in writeGzipFile
    f.write(content)
  File "/usr/local/cluster/python2.7/lib/python2.7/gzip.py", line 230, in write
    self.crc = zlib.crc32(data, self.crc) & 0xffffffffL
OverflowError: size does not fit in an int
None

Ok so the max size of an int being 2,147,483,647, my chunk of data is about 3,854,674,090 according to my log. This chunk is a string to which I applied the __len__() function.

So as I planned to do, and as Antti Haapala suggested, I am about to read smaller chunks at a time in order to sequencially write them to smaller files.

510

asked Mar 15 '16 12:03

kaligne

1 Answers

In any case, I suspect the reason is some kind of out-of-memory error. It is quite unclear to me why'd you not write this data smaller amount a time; here using the chunks method from this answer:

def chunks(l, n):
    """Yield successive n-sized chunks from l."""
    for i in xrange(0, len(l), n):
        yield l[i:i+n]

...
with gzip.open(file_name, 'wb') as f:
    for chunk in chunks(content, 65536):
        f.write(chunk)

That is, you do it like you'd eat an elephant, take one bite at a time.

answered Sep 22 '22 10:09

Antti Haapala -- Слава Україні

Related questions
                            
                                Reading from a file and writing to StringIO - Python
                            
                                Capture debug output from Python smtplib
                            
                                Python GEOS ImportError
                            
                                django, property update a model instance
                            
                                CPython - Read Python Dictionary (keys/values) inside a C Function Passed as argument
                            
                                How can i detect one word with speech recognition in Python
                            
                                How to optimize image size using wand in python
                            
                                Django/sqlite3 "OperationalError: no such table" on threaded operation
                            
                                Python Scapy / operator, | pipe in types
                            
                                Python package installation: pip vs yum, or both together?
                            
                                Wiener Filter for image deblur
                            
                                Why does Pandas coerce my numpy float32 to float64?
                            
                                Multiple graphs on the same plot in seaborn
                            
                                Python PrettyTable: Add title above the table's header
                            
                                Best practices for architecturing data validation in a Django multi sided project [closed]
                            
                                Can I add field in __init__ wtforms
                            
                                How can I vectorize this for loop in numpy?
                            
                                Convert images drawn by turtle to PNG in Python
                            
                                Can I use same virtual environment on different computers
                            
                                Why can't the underscore be matched by '\W'?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With