Reading unicode elements into numpy array

Question

Consider a text file called "new.txt" containing the following elements:

μm
∂r
∆λ

In Python 2.7, I can read the file by typing:

>>> import codecs
>>> f = codecs.open('new.txt', encoding='utf-8')
>>> lines = [line.strip() for line in f2.readlines()]
>>> lines
[u'\u03bcm', u'\u2202r', u'\u2206\u03bb']
>>> print lines[0]
μm

So far so good. I can easily convert this list to a numpy array via:

>>> import numpy as np
>>> arr = np.array(lines)
>>> arr
array([u'\u03bcm', u'\u2202r', u'\u2206\u03bb'], 
      dtype='<U2')

The issue is, I can't read this file directly via numpy's loadtxt function:

>>> np.loadtxt('new.txt', dtype=np.unicode_)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib64/python2.7/site-packages/numpy/lib/npyio.py", line 805, in loadtxt
    X = np.array(X, dtype)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xce in position 0: ordinal not in range(128)

What is the correct way to read this file into numpy directly?

Thanks.

Sven Marnach · Accepted Answer

In memory, unicode strings are represented as UCS-2 or UCS-4, depending on how your Python interpreter was compiled. Your file is encoded in UTF-8, so you need to recode it before you can map it to the NumPy array. loadtxt() can't do the recoding for you -- after all NumPy is mainly targeted at numerical arrays.

Assuming every line has the same number of characters, you could also use the more efficient variant

s = codecs.open("new.txt", encoding="utf-8").read()
arr = numpy.frombuffer(s, dtype="<U3")

This will include the newline characters in the strings. To not include them, use

arr = numpy.frombuffer(s.replace("
", ""), dtype="<U2")

Edit: If the lines of your file have different lengths and you would like to avoid the intermediate list, you can use

arr = numpy.fromiter(codecs.open("new.txt", encoding="utf-8"), dtype="<U2")

I'm not sure if this will internally create some temporary list, though.

pv. · Answer

If you want to use loadtxt, you can either first load the raw byte array and then decode:

data = np.loadtxt('foo.txt', dtype='S8')
unicode_data = data.view(np.chararray).decode('utf-8')

or specify a converter for decoding:

data = np.loadtxt('foo.txt', converters={0: lambda x: unicode(x, 'utf-8')}, dtype='U2')

However, using fromiter as in Sven's answer is probably going to be more effective than loadtxt.

Reading unicode elements into numpy array

Tags:

python

unicode

numpy

Gökhan Sever

2 Answers

Sven Marnach

pv.

Recent Activity

Donate For Us

Reading unicode elements into numpy array

Tags:

python

unicode

numpy

Gökhan Sever

2 Answers

Sven Marnach

pv.

Related questions

Recent Activity

Donate For Us