Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Loading UTF-8 file in Python 3 using numpy.genfromtxt

I have a CSV file that I downloaded from WHO site (http://apps.who.int/gho/data/view.main.52160 , Downloads, "multipurpose table in CSV format"). I try to load the file into a numpy array. Here's my code:

import numpy
#U75 - unicode string of max. length 75
world_alcohol = numpy.genfromtxt("xmart.csv", dtype="U75", skip_header=2, delimiter=",")
print(world_alcohol)

And I get

UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 2: ordinal not in range(128).

I guess that numpy has a problem reading the string "Côte d'Ivoire". The file is properly encoded UTF-8 (according to my text editor). I am using Python 3.4.3 and numpy 1.9.2.

What am I doing wrong? How can I read the file into numpy?

like image 609
JustAC0der Avatar asked Oct 07 '15 19:10

JustAC0der


People also ask

How does NumPy Genfromtxt work?

genfromtxt. Load data from a text file, with missing values handled as specified. Each line past the first skip_header lines is split at the delimiter character, and characters following the comments character are discarded.

What does Loadtxt () do in NumPy?

Load data from a text file. Each row in the text file must have the same number of values. File, filename, list, or generator to read.


1 Answers

Note the original 2015 date. Since then genfromtxt has gotten an encoding parameter.


In Python3 I can do:

In [224]: txt = "Côte d'Ivoire"
In [225]: x = np.zeros((2,),dtype='U20')
In [226]: x[0] = txt
In [227]: x
Out[227]: 
array(["Côte d'Ivoire", ''],   dtype='<U20')

Which means I probably could open a 'UTF-8' file (regular, not byte mode), and readlines, and assign them to elements of an array like x.

But genfromtxt insists on operating with byte strings (ascii) which can't handle the larger UTF-8 set (7 bytes v 8). So I need to apply decode at some point to get an U array.

I can load it into a 'S' array with genfromtxt:

In [258]: txt="Côte d'Ivoire"
In [259]: a=np.genfromtxt([txt.encode()],delimiter=',',dtype='S20')
In [260]: a
Out[260]: 
array(b"C\xc3\xb4te d'Ivoire",  dtype='|S20')

and apply decode to individual elements:

In [261]: print(a.item().decode())
Côte d'Ivoire

In [325]: print _
Côte d'Ivoire

Or use np.char.decode to apply it to each element of an array:

In [263]: np.char.decode(a)
Out[263]: 
array("Côte d'Ivoire", dtype='<U13')
In [264]: print(_)
Côte d'Ivoire

genfromtxt lets me specify converters:

In [297]: np.genfromtxt([txt.encode()],delimiter=',',dtype='U20',
    converters={0:lambda x: x.decode()})
Out[297]: 
array("Côte d'Ivoire", dtype='<U20')

If the csv has a mix of strings and numbers, this converters approach will be easier to use than the np.char.decode. Just specify the converter for each string column.

(See my earlier edits for Python2 tries).

like image 63
hpaulj Avatar answered Oct 25 '22 01:10

hpaulj