I have a CSV file that I downloaded from WHO site (http://apps.who.int/gho/data/view.main.52160 , Downloads, "multipurpose table in CSV format"). I try to load the file into a numpy array. Here's my code: <pre class="prettyprint"><code>import numpy #U75 - unicode string of max. length 75 world_alcohol = numpy.genfromtxt("xmart.csv", dtype="U75", skip_header=2, delimiter=",") print(world_alcohol) </code></pre> And I get <blockquote> UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 2: ordinal not in range(128). </blockquote> I guess that numpy has a problem reading the string "Côte d'Ivoire". The file is properly encoded UTF-8 (according to my text editor). I am using Python 3.4.3 and numpy 1.9.2. What am I doing wrong? How can I read the file into numpy?

Note the original 2015 date. Since then <code>genfromtxt</code> has gotten an <code>encoding</code> parameter. <hr> In Python3 I can do: <pre class="prettyprint"><code>In [224]: txt = "Côte d'Ivoire" In [225]: x = np.zeros((2,),dtype='U20') In [226]: x[0] = txt In [227]: x Out[227]: array(["Côte d'Ivoire", ''], dtype='<U20') </code></pre> Which means I probably could open a 'UTF-8' file (regular, not byte mode), and readlines, and assign them to elements of an array like <code>x</code>. But <code>genfromtxt</code> insists on operating with byte strings (ascii) which can't handle the larger <code>UTF-8</code> set (7 bytes v 8). So I need to apply <code>decode</code> at some point to get an <code>U</code> array. I can load it into a 'S' array with <code>genfromtxt</code>: <pre class="prettyprint"><code>In [258]: txt="Côte d'Ivoire" In [259]: a=np.genfromtxt([txt.encode()],delimiter=',',dtype='S20') In [260]: a Out[260]: array(b"C\xc3\xb4te d'Ivoire", dtype='|S20') </code></pre> and apply <code>decode</code> to individual elements: <pre class="prettyprint"><code>In [261]: print(a.item().decode()) Côte d'Ivoire In [325]: print _ Côte d'Ivoire </code></pre> Or use <code>np.char.decode</code> to apply it to each element of an array: <pre class="prettyprint"><code>In [263]: np.char.decode(a) Out[263]: array("Côte d'Ivoire", dtype='<U13') In [264]: print(_) Côte d'Ivoire </code></pre> <code>genfromtxt</code> lets me specify <code>converters</code>: <pre class="prettyprint"><code>In [297]: np.genfromtxt([txt.encode()],delimiter=',',dtype='U20', converters={0:lambda x: x.decode()}) Out[297]: array("Côte d'Ivoire", dtype='<U20') </code></pre> If the <code>csv</code> has a mix of strings and numbers, this <code>converters</code> approach will be easier to use than the <code>np.char.decode</code>. Just specify the converter for each string column. (See my earlier edits for Python2 tries).

Loading UTF-8 file in Python 3 using numpy.genfromtxt

Tags:

python

csv

utf-8

numpy

I have a CSV file that I downloaded from WHO site (http://apps.who.int/gho/data/view.main.52160 , Downloads, "multipurpose table in CSV format"). I try to load the file into a numpy array. Here's my code:

import numpy
#U75 - unicode string of max. length 75
world_alcohol = numpy.genfromtxt("xmart.csv", dtype="U75", skip_header=2, delimiter=",")
print(world_alcohol)

And I get

UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 2: ordinal not in range(128).

I guess that numpy has a problem reading the string "Côte d'Ivoire". The file is properly encoded UTF-8 (according to my text editor). I am using Python 3.4.3 and numpy 1.9.2.

What am I doing wrong? How can I read the file into numpy?

609

asked Oct 07 '15 19:10

JustAC0der

1 Answers

Note the original 2015 date. Since then genfromtxt has gotten an encoding parameter.

In Python3 I can do:

In [224]: txt = "Côte d'Ivoire"
In [225]: x = np.zeros((2,),dtype='U20')
In [226]: x[0] = txt
In [227]: x
Out[227]: 
array(["Côte d'Ivoire", ''],   dtype='<U20')

Which means I probably could open a 'UTF-8' file (regular, not byte mode), and readlines, and assign them to elements of an array like x.

But genfromtxt insists on operating with byte strings (ascii) which can't handle the larger UTF-8 set (7 bytes v 8). So I need to apply decode at some point to get an U array.

I can load it into a 'S' array with genfromtxt:

In [258]: txt="Côte d'Ivoire"
In [259]: a=np.genfromtxt([txt.encode()],delimiter=',',dtype='S20')
In [260]: a
Out[260]: 
array(b"C\xc3\xb4te d'Ivoire",  dtype='|S20')

and apply decode to individual elements:

In [261]: print(a.item().decode())
Côte d'Ivoire

In [325]: print _
Côte d'Ivoire

Or use np.char.decode to apply it to each element of an array:

In [263]: np.char.decode(a)
Out[263]: 
array("Côte d'Ivoire", dtype='<U13')
In [264]: print(_)
Côte d'Ivoire

genfromtxt lets me specify converters:

In [297]: np.genfromtxt([txt.encode()],delimiter=',',dtype='U20',
    converters={0:lambda x: x.decode()})
Out[297]: 
array("Côte d'Ivoire", dtype='<U20')

If the csv has a mix of strings and numbers, this converters approach will be easier to use than the np.char.decode. Just specify the converter for each string column.

(See my earlier edits for Python2 tries).

answered Oct 25 '22 01:10

hpaulj

Related questions
                            
                                Uniformly shuffle 5 gigabytes of numpy data
                            
                                Set value of excluded field in django ModelForm programmatically
                            
                                Extend user model Django REST framework 3.x.x
                            
                                How to validate URL parameters in Flask app?
                            
                                Panda dataframe conditional .mean() depending on values in certain column
                            
                                Basic prime number generator in Python
                            
                                How can a pointer be passed between Rust and Python?
                            
                                Get all scope names on Sublime Text 3
                            
                                Using SBT to manage projects that contain both Scala and Python
                            
                                How to write an ipython alias which executes in python instead of shell?
                            
                                Python pyproj convert ecef to lla
                            
                                How to execute an .sql file in pymssql
                            
                                What's the cleanest way to set up an enumeration in Python? [duplicate]
                            
                                Seaborn heatmap by column
                            
                                Python plotting error bars with different values above and below the point
                            
                                Nested for-loops and dictionaries in finding value occurrence in string
                            
                                chain two remote tasks in celery by send_task
                            
                                python - crontab to run a script
                            
                                Prime number hard drive storage for very large primes - Sieve of Atkin
                            
                                How do I get IPython profile behavior from Jupyter 4.x?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With