I have a CSV file that I downloaded from WHO site (http://apps.who.int/gho/data/view.main.52160 , Downloads, "multipurpose table in CSV format"). I try to load the file into a numpy array. Here's my code:
import numpy
#U75 - unicode string of max. length 75
world_alcohol = numpy.genfromtxt("xmart.csv", dtype="U75", skip_header=2, delimiter=",")
print(world_alcohol)
And I get
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 2: ordinal not in range(128).
I guess that numpy has a problem reading the string "Côte d'Ivoire". The file is properly encoded UTF-8 (according to my text editor). I am using Python 3.4.3 and numpy 1.9.2.
What am I doing wrong? How can I read the file into numpy?
genfromtxt. Load data from a text file, with missing values handled as specified. Each line past the first skip_header lines is split at the delimiter character, and characters following the comments character are discarded.
Load data from a text file. Each row in the text file must have the same number of values. File, filename, list, or generator to read.
Note the original 2015 date. Since then genfromtxt
has gotten an encoding
parameter.
In Python3 I can do:
In [224]: txt = "Côte d'Ivoire"
In [225]: x = np.zeros((2,),dtype='U20')
In [226]: x[0] = txt
In [227]: x
Out[227]:
array(["Côte d'Ivoire", ''], dtype='<U20')
Which means I probably could open a 'UTF-8' file (regular, not byte mode), and readlines, and assign them to elements of an array like x
.
But genfromtxt
insists on operating with byte strings (ascii) which can't handle the larger UTF-8
set (7 bytes v 8). So I need to apply decode
at some point to get an U
array.
I can load it into a 'S' array with genfromtxt
:
In [258]: txt="Côte d'Ivoire"
In [259]: a=np.genfromtxt([txt.encode()],delimiter=',',dtype='S20')
In [260]: a
Out[260]:
array(b"C\xc3\xb4te d'Ivoire", dtype='|S20')
and apply decode
to individual elements:
In [261]: print(a.item().decode())
Côte d'Ivoire
In [325]: print _
Côte d'Ivoire
Or use np.char.decode
to apply it to each element of an array:
In [263]: np.char.decode(a)
Out[263]:
array("Côte d'Ivoire", dtype='<U13')
In [264]: print(_)
Côte d'Ivoire
genfromtxt
lets me specify converters
:
In [297]: np.genfromtxt([txt.encode()],delimiter=',',dtype='U20',
converters={0:lambda x: x.decode()})
Out[297]:
array("Côte d'Ivoire", dtype='<U20')
If the csv
has a mix of strings and numbers, this converters
approach will be easier to use than the np.char.decode
. Just specify the converter for each string column.
(See my earlier edits for Python2 tries).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With