Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Reading utf-8 characters from a gzip file in python

I am trying to read a gunzipped file (.gz) in python and am having some trouble.

I used the gzip module to read it but the file is encoded as a utf-8 text file so eventually it reads an invalid character and crashes.

Does anyone know how to read gzip files encoded as utf-8 files? I know that there's a codecs module that can help but I can't understand how to use it.

Thanks!

import string import gzip import codecs  f = gzip.open('file.gz','r')  engines = {} line = f.readline() while line:     parsed = string.split(line, u'\u0001')      #do some things...      line = f.readline() for en in engines:   print(en) 
like image 691
Juan Besa Avatar asked Dec 10 '09 20:12

Juan Besa


People also ask

How do I read a .GZ file in Python?

open() This function opens a gzip-compressed file in binary or text mode and returns a file like object, which may be physical file, a string or byte object. By default, the file is opened in 'rb' mode i.e. reading binary data, however, the mode parameter to this function can take other modes as listed below.

How do you write to a text file with UTF-8 in Python?

Always close the file after completing writing using the close() method or use the with statement when opening the file. Use write() and writelines() methods to write to a text file. Pass the encoding='utf-8' to the open() function to write UTF-8 characters into a file.

Is gzip a standard Python library?

This module provides us with high-level functions such as open() , compress() and decompress() , for quickly dealing with these file extensions. Essentially, this will be simply opening a file! There is no need to pip install this module since it is a part of the standard library!


2 Answers

This is possible since Python 3.3:

import gzip gzip.open('file.gz', 'rt', encoding='utf-8') 

Notice that gzip.open() requires you to explicitly specify text mode ('t').

like image 149
Seppo Enarvi Avatar answered Sep 29 '22 05:09

Seppo Enarvi


I don't see why this should be so hard.

What are you doing exactly? Please explain "eventually it reads an invalid character".

It should be as simple as:

import gzip fp = gzip.open('foo.gz') contents = fp.read() # contents now has the uncompressed bytes of foo.gz fp.close() u_str = contents.decode('utf-8') # u_str is now a unicode string 

EDITED

This answer works for Python2 in Python3, please see @SeppoEnarvi 's answer at https://stackoverflow.com/a/19794943/610569 (it uses the rt mode for gzip.open.

like image 42
sjbrown Avatar answered Sep 29 '22 04:09

sjbrown