I am trying to read a gunzipped file (.gz) in python and am having some trouble.
I used the gzip module to read it but the file is encoded as a utf-8 text file so eventually it reads an invalid character and crashes.
Does anyone know how to read gzip files encoded as utf-8 files? I know that there's a codecs module that can help but I can't understand how to use it.
Thanks!
import string import gzip import codecs f = gzip.open('file.gz','r') engines = {} line = f.readline() while line: parsed = string.split(line, u'\u0001') #do some things... line = f.readline() for en in engines: print(en)
open() This function opens a gzip-compressed file in binary or text mode and returns a file like object, which may be physical file, a string or byte object. By default, the file is opened in 'rb' mode i.e. reading binary data, however, the mode parameter to this function can take other modes as listed below.
Always close the file after completing writing using the close() method or use the with statement when opening the file. Use write() and writelines() methods to write to a text file. Pass the encoding='utf-8' to the open() function to write UTF-8 characters into a file.
This module provides us with high-level functions such as open() , compress() and decompress() , for quickly dealing with these file extensions. Essentially, this will be simply opening a file! There is no need to pip install this module since it is a part of the standard library!
This is possible since Python 3.3:
import gzip gzip.open('file.gz', 'rt', encoding='utf-8')
Notice that gzip.open() requires you to explicitly specify text mode ('t').
I don't see why this should be so hard.
What are you doing exactly? Please explain "eventually it reads an invalid character".
It should be as simple as:
import gzip fp = gzip.open('foo.gz') contents = fp.read() # contents now has the uncompressed bytes of foo.gz fp.close() u_str = contents.decode('utf-8') # u_str is now a unicode string
This answer works for Python2
in Python3
, please see @SeppoEnarvi 's answer at https://stackoverflow.com/a/19794943/610569 (it uses the rt
mode for gzip.open
.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With