Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Numpy String Encoding

The module numpy is an excellent tool for memory-efficient storage of python objects, among them strings. For ANSI strings in numpy arrays only 1 byte per character is used.

However, there is one inconvenience. The type of stored objects is no more string but bytes, which means that have to be decoded for further use in most cases, which in turn means quite bulky code:

>>> import numpy
>>> my_array = numpy.array(['apple', 'pear'], dtype = 'S5')
>>> print("Mary has an {} and a {}".format(my_array[0], my_array[1]))
Mary has an b'apple' and a b'pear'
>>> print("Mary has an {} and a {}".format(my_array[0].decode('utf-8'),
... my_array[1].decode('utf-8')))
Mary has an apple and a pear

This inconvenience can be eliminated by using another data type, e.g:

>>> my_array = numpy.array(['apple', 'pear'], dtype = 'U5')
>>> print("Mary has an {} and a {}".format(my_array[0], my_array[1]))
Mary has an apple and a pear

However, this is achieved only by cost of 4-fold increase in memory usage:

>>> numpy.info(my_array)
class:  ndarray
shape:  (2,)
strides:  (20,)
itemsize:  20
aligned:  True
contiguous:  True
fortran:  True
data pointer: 0x1a5b020
byteorder:  little
byteswap:  False
type: <U5

Is there a solution that combines advantages of both efficient memory allocation and convenient usage for ANSI strings?

like image 349
Roman Avatar asked Aug 25 '15 14:08

Roman


People also ask

What encoding are strings in Python?

String Encoding Since Python 3.0, strings are stored as Unicode, i.e. each character in the string is represented by a code point. So, each string is just a sequence of Unicode code points. For efficient storage of these strings, the sequence of code points is converted into a set of bytes.

Can NumPy handle strings?

The elements of a NumPy array, or simply an array, are usually numbers, but can also be boolians, strings, or other objects.

How do you encode an array in Python?

To encode string array values, use the numpy. char. encode() method in Python Numpy. The arr is the input array to be encoded.

What does .all do in NumPy?

all() in Python. The numpy. all() function tests whether all array elements along the mentioned axis evaluate to True.


2 Answers

It's not a big difference over the decode, but astype works (and can be applied to the whole array rather than each string). But the longer array will remain around as long as it is needed.

In [538]: x=my_array.astype('U');"Mary has an {} and a {}".format(x[0],x[1])
Out[538]: 'Mary has an apple and a pear'

I can't find anything in the format syntax that would force 'b' less formatting.

https://stackoverflow.com/a/19864787/901925 - shows how to customize the Formatter class, changing the format_field method. I tried something similar with the convert_field method. But the calling syntax is still messy.

In [562]: def makeU(astr):
    return astr.decode('utf-8')
   .....: 

In [563]: class MyFormatter(string.Formatter):
    def convert_field(self, value, conversion):
        if 'q'== conversion:
            return makeU(value)
        else:
            return super(MyFormatter, self).convert_field(value, conversion)
   .....:         

In [564]: MyFormatter().format("Mary has an {!q} and a {!q}",my_array[0],my_array[1])
Out[564]: 'Mary has an apple and a pear'

A couple of other ways of doing this formatting:

In [642]: "Mary has an {1} and a {0} or {1}".format(*my_array.astype('U'))
Out[642]: 'Mary has an pear and a apple or pear'

This converts the array (on the fly) and passes it to format as a list. It also works if the array is already unicode:

In [643]: "Mary has an {1} and a {0} or {1}".format(*uarray.astype('U'))
Out[643]: 'Mary has an pear and a apple or pear'

np.char has functions that apply string functions to elements of a character array. With this decode can be applied to the whole array:

In [644]: "Mary has a {1} and an {0}".format(*np.char.decode(my_array))
Out[644]: 'Mary has a pear and an apple'

(this doesn't work if the array is already unicode).

If you do much with string arrays, np.char is worth a study.

like image 60
hpaulj Avatar answered Oct 27 '22 23:10

hpaulj


Given:

>>> my_array = numpy.array(['apple', 'pear'], dtype = 'S5')

You can decode on the fly:

>>> print("Mary has an {} and a {}".format(*map(lambda b: b.decode('utf-8'), my_array)))
Mary has an apple and a pear

Or you can create a specific formatter:

import string
class ByteFormatter(string.Formatter):
    def __init__(self, decoder='utf-8'):
        self.decoder=decoder

    def format_field(self, value, spec):
        if isinstance(value, bytes):
            return value.decode(self.decoder)
        return super(ByteFormatter, self).format_field(value, spec)   

>>> print(ByteFormatter().format("Mary has an {} and a {}", *my_array))
Mary has an apple and a pear
like image 24
dawg Avatar answered Oct 28 '22 01:10

dawg