The module numpy is an excellent tool for memory-efficient storage of python objects, among them strings. For ANSI strings in numpy arrays only 1 byte per character is used.
However, there is one inconvenience. The type of stored objects is no more string
but bytes
, which means that have to be decoded for further use in most cases, which in turn means quite bulky code:
>>> import numpy
>>> my_array = numpy.array(['apple', 'pear'], dtype = 'S5')
>>> print("Mary has an {} and a {}".format(my_array[0], my_array[1]))
Mary has an b'apple' and a b'pear'
>>> print("Mary has an {} and a {}".format(my_array[0].decode('utf-8'),
... my_array[1].decode('utf-8')))
Mary has an apple and a pear
This inconvenience can be eliminated by using another data type, e.g:
>>> my_array = numpy.array(['apple', 'pear'], dtype = 'U5')
>>> print("Mary has an {} and a {}".format(my_array[0], my_array[1]))
Mary has an apple and a pear
However, this is achieved only by cost of 4-fold increase in memory usage:
>>> numpy.info(my_array)
class: ndarray
shape: (2,)
strides: (20,)
itemsize: 20
aligned: True
contiguous: True
fortran: True
data pointer: 0x1a5b020
byteorder: little
byteswap: False
type: <U5
Is there a solution that combines advantages of both efficient memory allocation and convenient usage for ANSI strings?
String Encoding Since Python 3.0, strings are stored as Unicode, i.e. each character in the string is represented by a code point. So, each string is just a sequence of Unicode code points. For efficient storage of these strings, the sequence of code points is converted into a set of bytes.
The elements of a NumPy array, or simply an array, are usually numbers, but can also be boolians, strings, or other objects.
To encode string array values, use the numpy. char. encode() method in Python Numpy. The arr is the input array to be encoded.
all() in Python. The numpy. all() function tests whether all array elements along the mentioned axis evaluate to True.
It's not a big difference over the decode
, but astype
works (and can be applied to the whole array rather than each string). But the longer array will remain around as long as it is needed.
In [538]: x=my_array.astype('U');"Mary has an {} and a {}".format(x[0],x[1])
Out[538]: 'Mary has an apple and a pear'
I can't find anything in the format
syntax that would force 'b' less formatting.
https://stackoverflow.com/a/19864787/901925
- shows how to customize the Formatter class, changing the format_field
method. I tried something similar with the convert_field
method. But the calling syntax is still messy.
In [562]: def makeU(astr):
return astr.decode('utf-8')
.....:
In [563]: class MyFormatter(string.Formatter):
def convert_field(self, value, conversion):
if 'q'== conversion:
return makeU(value)
else:
return super(MyFormatter, self).convert_field(value, conversion)
.....:
In [564]: MyFormatter().format("Mary has an {!q} and a {!q}",my_array[0],my_array[1])
Out[564]: 'Mary has an apple and a pear'
A couple of other ways of doing this formatting:
In [642]: "Mary has an {1} and a {0} or {1}".format(*my_array.astype('U'))
Out[642]: 'Mary has an pear and a apple or pear'
This converts the array (on the fly) and passes it to format
as a list. It also works if the array is already unicode:
In [643]: "Mary has an {1} and a {0} or {1}".format(*uarray.astype('U'))
Out[643]: 'Mary has an pear and a apple or pear'
np.char
has functions that apply string functions to elements of a character array. With this decode
can be applied to the whole array:
In [644]: "Mary has a {1} and an {0}".format(*np.char.decode(my_array))
Out[644]: 'Mary has a pear and an apple'
(this doesn't work if the array is already unicode).
If you do much with string arrays, np.char
is worth a study.
Given:
>>> my_array = numpy.array(['apple', 'pear'], dtype = 'S5')
You can decode on the fly:
>>> print("Mary has an {} and a {}".format(*map(lambda b: b.decode('utf-8'), my_array)))
Mary has an apple and a pear
Or you can create a specific formatter:
import string
class ByteFormatter(string.Formatter):
def __init__(self, decoder='utf-8'):
self.decoder=decoder
def format_field(self, value, spec):
if isinstance(value, bytes):
return value.decode(self.decoder)
return super(ByteFormatter, self).format_field(value, spec)
>>> print(ByteFormatter().format("Mary has an {} and a {}", *my_array))
Mary has an apple and a pear
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With