Is there a native numpy way to convert an array of string representations of booleans eg: <pre class="prettyprint"><code>['True','False','True','False'] </code></pre> To an actual boolean array I can use for masking/indexing? I could do a for loop going through and rebuilding the array but for large arrays this is slow.

You should be able to do a boolean comparison, IIUC, whether the <code>dtype</code> is a string or <code>object</code>: <pre class="prettyprint"><code>>>> a = np.array(['True', 'False', 'True', 'False']) >>> a array(['True', 'False', 'True', 'False'], dtype='|S5') >>> a == "True" array([ True, False, True, False], dtype=bool) </code></pre> or <pre class="prettyprint"><code>>>> a = np.array(['True', 'False', 'True', 'False'], dtype=object) >>> a array(['True', 'False', 'True', 'False'], dtype=object) >>> a == "True" array([ True, False, True, False], dtype=bool) </code></pre>

I've found a method that's even faster than DSM's, taking inspiration from Eric, though the improvement is best seen with smaller lists of values; at very large values, the cost of the iterating itself starts to outweigh the advantage of performing the truth testing during creation of the numpy array rather than after. Testing with both <code>is</code> and <code>==</code> (for situations where the strings are interned versus when they might not be, as <code>is</code> would not work with non-interned strings. As <code>'True'</code> is probably going to be a literal in the script it should be interned, though) showed that while my version with <code>==</code> was slower than with <code>is</code>, it was still much faster than DSM's version. Test setup: <pre class="prettyprint"><code>import timeit def timer(statement, count): return timeit.repeat(statement, "from random import choice;import numpy as np;x = [choice(['True', 'False']) for i in range(%i)]" % count) >>> stateIs = "y = np.fromiter((e is 'True' for e in x), bool)" >>> stateEq = "y = np.fromiter((e == 'True' for e in x), bool)" >>> stateDSM = "y = np.array(x) == 'True'" </code></pre> With 1000 items, the faster statements take about 66% the time of DSM's: <pre class="prettyprint"><code>>>> timer(stateIs, 1000) [101.77722641656146, 100.74985342340369, 101.47228618107965] >>> timer(stateEq, 1000) [112.26464996250706, 112.50754567379681, 112.76057346127709] >>> timer(stateDSM, 1000) [155.67689949529995, 155.96820504501557, 158.32394669279802] </code></pre> For smaller string arrays (in the hundreds rather than thousands), the elapsed time is less than 50% of DSM's: <pre class="prettyprint"><code>>>> timer(stateIs, 100) [11.947757485669172, 11.927990253608186, 12.057855628259858] >>> timer(stateEq, 100) [13.064947253943501, 13.161545451986967, 13.30599035623618] >>> timer(stateDSM, 100) [31.270060799078237, 30.941749748808434, 31.253922641324607] </code></pre> A bit over 25% of DSM's when done with 50 items per list: <pre class="prettyprint"><code>>>> timer(stateIs, 50) [6.856538342483873, 6.741083326021908, 6.708402786859551] >>> timer(stateEq, 50) [7.346079345032194, 7.312723444475523, 7.309259899921017] >>> timer(stateDSM, 50) [24.154247576229864, 24.173593700599667, 23.946403452288905] </code></pre> For 5 items, about 11% of DSM's: <pre class="prettyprint"><code>>>> timer(stateIs, 5) [1.8826215278058953, 1.850232652068371, 1.8559381315990322] >>> timer(stateEq, 5) [1.9252821868467436, 1.894011299061276, 1.894306935199893] >>> timer(stateDSM, 5) [18.060974208809057, 17.916322392367874, 17.8379771602049] </code></pre>

Numpy Convert String Representation of Boolean Array To Boolean Array

Tags:

python

numpy

Is there a native numpy way to convert an array of string representations of booleans eg:

['True','False','True','False']

To an actual boolean array I can use for masking/indexing? I could do a for loop going through and rebuilding the array but for large arrays this is slow.

921

asked Jun 05 '13 16:06

Newmu

2 Answers

You should be able to do a boolean comparison, IIUC, whether the dtype is a string or object:

>>> a = np.array(['True', 'False', 'True', 'False'])
>>> a
array(['True', 'False', 'True', 'False'], 
      dtype='|S5')
>>> a == "True"
array([ True, False,  True, False], dtype=bool)

>>> a = np.array(['True', 'False', 'True', 'False'], dtype=object)
>>> a
array(['True', 'False', 'True', 'False'], dtype=object)
>>> a == "True"
array([ True, False,  True, False], dtype=bool)

140

answered Oct 07 '22 14:10

DSM

I've found a method that's even faster than DSM's, taking inspiration from Eric, though the improvement is best seen with smaller lists of values; at very large values, the cost of the iterating itself starts to outweigh the advantage of performing the truth testing during creation of the numpy array rather than after. Testing with both is and == (for situations where the strings are interned versus when they might not be, as is would not work with non-interned strings. As 'True' is probably going to be a literal in the script it should be interned, though) showed that while my version with == was slower than with is, it was still much faster than DSM's version.

Test setup:

import timeit
def timer(statement, count):
    return timeit.repeat(statement, "from random import choice;import numpy as np;x = [choice(['True', 'False']) for i in range(%i)]" % count)

>>> stateIs = "y = np.fromiter((e is 'True' for e in x), bool)"
>>> stateEq = "y = np.fromiter((e == 'True' for e in x), bool)"
>>> stateDSM = "y = np.array(x) == 'True'"

With 1000 items, the faster statements take about 66% the time of DSM's:

>>> timer(stateIs, 1000)
[101.77722641656146, 100.74985342340369, 101.47228618107965]
>>> timer(stateEq, 1000)
[112.26464996250706, 112.50754567379681, 112.76057346127709]
>>> timer(stateDSM, 1000)
[155.67689949529995, 155.96820504501557, 158.32394669279802]

For smaller string arrays (in the hundreds rather than thousands), the elapsed time is less than 50% of DSM's:

>>> timer(stateIs, 100)
[11.947757485669172, 11.927990253608186, 12.057855628259858]
>>> timer(stateEq, 100)
[13.064947253943501, 13.161545451986967, 13.30599035623618]
>>> timer(stateDSM, 100)
[31.270060799078237, 30.941749748808434, 31.253922641324607]

A bit over 25% of DSM's when done with 50 items per list:

>>> timer(stateIs, 50)
[6.856538342483873, 6.741083326021908, 6.708402786859551]
>>> timer(stateEq, 50)
[7.346079345032194, 7.312723444475523, 7.309259899921017]
>>> timer(stateDSM, 50)
[24.154247576229864, 24.173593700599667, 23.946403452288905]

For 5 items, about 11% of DSM's:

>>> timer(stateIs, 5)
[1.8826215278058953, 1.850232652068371, 1.8559381315990322]
>>> timer(stateEq, 5)
[1.9252821868467436, 1.894011299061276, 1.894306935199893]
>>> timer(stateDSM, 5)
[18.060974208809057, 17.916322392367874, 17.8379771602049]

answered Oct 07 '22 13:10

JAB

Related questions
                            
                                Hebrew calendar in python
                            
                                mathematical limits in python?
                            
                                Python matlplotlib add hyperlink to text
                            
                                pydoc supported python metadata such as __version__ = '0.1'
                            
                                How to read filenames included into a gz file
                            
                                append 2 hex values in python
                            
                                How to specify explicit python packaging dependencies in setup.py? [duplicate]
                            
                                TypeError: bad operand type for unary -: 'str'
                            
                                What is `scipy.i`?
                            
                                Unexpected behavior for numpy self division
                            
                                Celery tries to connect to the wrong broker
                            
                                Python "with" Keyword in Lambda Functions
                            
                                Dynamically add legends to matplotlib plots in python
                            
                                How To Limit Properties Available On a Python Class
                            
                                What's the equivalent of cut/qcut for pandas date fields?
                            
                                Inserting datetime into MySql db
                            
                                Tornado and WTForms
                            
                                Is there a library for urllib2 for python which we can download?
                            
                                (Python) Estimating regression parameter confidence intervals with scikits bootstrap
                            
                                numpy np.array versus np.matrix (performance)

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With