The module numpy is an excellent tool for memory-efficient storage of python objects, among them strings. For ANSI strings in numpy arrays only 1 byte per character is used. However, there is one inconvenience. The type of stored objects is no more <code>string</code> but <code>bytes</code>, which means that have to be decoded for further use in most cases, which in turn means quite bulky code: <pre class="prettyprint"><code>>>> import numpy >>> my_array = numpy.array(['apple', 'pear'], dtype = 'S5') >>> print("Mary has an {} and a {}".format(my_array[0], my_array[1])) Mary has an b'apple' and a b'pear' >>> print("Mary has an {} and a {}".format(my_array[0].decode('utf-8'), ... my_array[1].decode('utf-8'))) Mary has an apple and a pear </code></pre> This inconvenience can be eliminated by using another data type, e.g: <pre class="prettyprint"><code>>>> my_array = numpy.array(['apple', 'pear'], dtype = 'U5') >>> print("Mary has an {} and a {}".format(my_array[0], my_array[1])) Mary has an apple and a pear </code></pre> However, this is achieved only by cost of 4-fold increase in memory usage: <pre class="prettyprint"><code>>>> numpy.info(my_array) class: ndarray shape: (2,) strides: (20,) </code></pre> <pre class="prettyprint">itemsize: 20</pre> <pre class="prettyprint"><code>aligned: True contiguous: True fortran: True data pointer: 0x1a5b020 byteorder: little byteswap: False type: <U5 </code></pre> Is there a solution that combines advantages of both efficient memory allocation and convenient usage for ANSI strings?

Given: <pre class="prettyprint"><code>>>> my_array = numpy.array(['apple', 'pear'], dtype = 'S5') </code></pre> You can decode on the fly: <pre class="prettyprint"><code>>>> print("Mary has an {} and a {}".format(*map(lambda b: b.decode('utf-8'), my_array))) Mary has an apple and a pear </code></pre> Or you can create a specific formatter: <pre class="prettyprint"><code>import string class ByteFormatter(string.Formatter): def __init__(self, decoder='utf-8'): self.decoder=decoder def format_field(self, value, spec): if isinstance(value, bytes): return value.decode(self.decoder) return super(ByteFormatter, self).format_field(value, spec) >>> print(ByteFormatter().format("Mary has an {} and a {}", *my_array)) Mary has an apple and a pear </code></pre>

Numpy String Encoding

Tags:

python

string

python-3.x

numpy

The module numpy is an excellent tool for memory-efficient storage of python objects, among them strings. For ANSI strings in numpy arrays only 1 byte per character is used.

However, there is one inconvenience. The type of stored objects is no more string but bytes, which means that have to be decoded for further use in most cases, which in turn means quite bulky code:

>>> import numpy
>>> my_array = numpy.array(['apple', 'pear'], dtype = 'S5')
>>> print("Mary has an {} and a {}".format(my_array[0], my_array[1]))
Mary has an b'apple' and a b'pear'
>>> print("Mary has an {} and a {}".format(my_array[0].decode('utf-8'),
... my_array[1].decode('utf-8')))
Mary has an apple and a pear

This inconvenience can be eliminated by using another data type, e.g:

>>> my_array = numpy.array(['apple', 'pear'], dtype = 'U5')
>>> print("Mary has an {} and a {}".format(my_array[0], my_array[1]))
Mary has an apple and a pear

However, this is achieved only by cost of 4-fold increase in memory usage:

>>> numpy.info(my_array)
class:  ndarray
shape:  (2,)
strides:  (20,)

itemsize:  20

aligned:  True
contiguous:  True
fortran:  True
data pointer: 0x1a5b020
byteorder:  little
byteswap:  False
type: <U5

Is there a solution that combines advantages of both efficient memory allocation and convenient usage for ANSI strings?

349

asked Aug 25 '15 14:08

Roman

2 Answers

It's not a big difference over the decode, but astype works (and can be applied to the whole array rather than each string). But the longer array will remain around as long as it is needed.

In [538]: x=my_array.astype('U');"Mary has an {} and a {}".format(x[0],x[1])
Out[538]: 'Mary has an apple and a pear'

I can't find anything in the format syntax that would force 'b' less formatting.

https://stackoverflow.com/a/19864787/901925 - shows how to customize the Formatter class, changing the format_field method. I tried something similar with the convert_field method. But the calling syntax is still messy.

In [562]: def makeU(astr):
    return astr.decode('utf-8')
   .....: 

In [563]: class MyFormatter(string.Formatter):
    def convert_field(self, value, conversion):
        if 'q'== conversion:
            return makeU(value)
        else:
            return super(MyFormatter, self).convert_field(value, conversion)
   .....:         

In [564]: MyFormatter().format("Mary has an {!q} and a {!q}",my_array[0],my_array[1])
Out[564]: 'Mary has an apple and a pear'

A couple of other ways of doing this formatting:

In [642]: "Mary has an {1} and a {0} or {1}".format(*my_array.astype('U'))
Out[642]: 'Mary has an pear and a apple or pear'

This converts the array (on the fly) and passes it to format as a list. It also works if the array is already unicode:

In [643]: "Mary has an {1} and a {0} or {1}".format(*uarray.astype('U'))
Out[643]: 'Mary has an pear and a apple or pear'

np.char has functions that apply string functions to elements of a character array. With this decode can be applied to the whole array:

In [644]: "Mary has a {1} and an {0}".format(*np.char.decode(my_array))
Out[644]: 'Mary has a pear and an apple'

(this doesn't work if the array is already unicode).

If you do much with string arrays, np.char is worth a study.

answered Oct 27 '22 23:10

hpaulj

Given:

>>> my_array = numpy.array(['apple', 'pear'], dtype = 'S5')

You can decode on the fly:

>>> print("Mary has an {} and a {}".format(*map(lambda b: b.decode('utf-8'), my_array)))
Mary has an apple and a pear

Or you can create a specific formatter:

import string
class ByteFormatter(string.Formatter):
    def __init__(self, decoder='utf-8'):
        self.decoder=decoder

    def format_field(self, value, spec):
        if isinstance(value, bytes):
            return value.decode(self.decoder)
        return super(ByteFormatter, self).format_field(value, spec)   

>>> print(ByteFormatter().format("Mary has an {} and a {}", *my_array))
Mary has an apple and a pear

answered Oct 28 '22 01:10

dawg

Related questions
                            
                                MemoryError's message as str is empty in Python
                            
                                Make a functional field editable in Openerp?
                            
                                Converting some columns from pandas dataframe to list of lists
                            
                                AUC-base Features Importance using Random Forest
                            
                                Finding when a value in a pandas.Series crosses/reaches a threshold
                            
                                TypeError: int() argument must be a string or a number, not 'Model Instance'
                            
                                Different YAML array representations
                            
                                how to select columns from R dataframe in rpy2 in python?
                            
                                Python Enum for Boolean variable
                            
                                Can't turn off images in Selenium / Firefox
                            
                                How to set grequests timeout
                            
                                Using Python's multiprocessing.pool.map to manipulate the same integer
                            
                                How to make values in list of dictionary unique?
                            
                                what is the significance of `__repr__` function over normal function [duplicate]
                            
                                Why can't I import statsmodels directly?
                            
                                Add Timestamp to ElasticSearch with Elasticsearch-py using Bulk-API
                            
                                modern approach to 3D visualization in python: discuss mayavi
                            
                                How to detect write failure in asyncio?
                            
                                Django admin asks for login after every click
                            
                                Pycharm: How to adjust color of variable/syntax highlighting?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With