Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

ASCII string as dtype for numpy array of strings in Python 3

NumPy's string dtype seems to correspond to Python's str and thus to change between Python 2.x and 3.x:

In Python 2.7:

In [1]: import numpy as np

In [2]: np.dtype((np.str_, 1)).itemsize
Out[2]: 1

In [3]: np.dtype((np.unicode_, 1)).itemsize
Out[3]: 4

In Python 3.3:

In [2]: np.dtype((np.str_, 1)).itemsize
Out[2]: 4

The version of NumPy is 1.7.0 in both cases.

I'm writing some code that I want to work on both Python versions, and I want an array of ASCII strings (4x memory overhead is not acceptable). So the questions are:

  • How do I define a dtype for an ASCII string of certain length (with 1 byte per char) in Python 3?
  • How do I do it in a way that also works in Python 2?
  • Bonus question: Can I limit the alphabet even further, e.g. to ascii_uppercase, and save a bit or two per char?

Something that I see as the potential answer are character arrays for the first question (i.e. have an array of character arrays instead of an array of strings). Seems like I can specify the item size when constructing one:

chararray(shape, itemsize=1, unicode=False, buffer=None, offset=0,
          strides=None, order=None)

Update: nah, the itemsize is actually the number of characters. But there's still unicode=False.

Is that the way to go?

Will it answer the last question, too?

And how do I actually use it as dtype?

like image 544
Lev Levitsky Avatar asked Mar 03 '13 08:03

Lev Levitsky


1 Answers

You can use the 'S' typestr:

>>> np.array(['Hello', 'World'], dtype='S')
array([b'Hello', b'World'], 
      dtype='|S5')

Also in 2.6/2.7 str is aliased to bytes (or np.bytes_):

>>> np.dtype((bytes, 1)) # 2.7
dtype('|S1')
>>> np.dtype((bytes, 1)) # 3.2
dtype('|S1')

And b'' literals are supported:

>>> np.array([b'Hello', b'World']) # 2.7
array(['Hello', 'World'], 
      dtype='|S5')
>>> np.array([b'Hello', b'World']) # 3.2
array([b'Hello', b'World'], 
      dtype='|S5')
like image 94
Eryk Sun Avatar answered Sep 25 '22 04:09

Eryk Sun