Is there a native numpy way to convert an array of string representations of booleans eg:
['True','False','True','False']
To an actual boolean array I can use for masking/indexing? I could do a for loop going through and rebuilding the array but for large arrays this is slow.
Use the bool() Function to Convert String to Boolean in Python. We can pass a string as the argument of the function to convert the string to a boolean value. This function returns true for every non-empty argument and false for empty arguments.
To convert a Boolean array a to an integer array, use the a. astype(int) method call. The single argument int specifies the desired data type of each array item.
We can also index NumPy arrays using a NumPy array of boolean values on one axis to specify the indices that we want to access. This will create a NumPy array of size 3x4 (3 rows and 4 columns) with values from 0 to 11 (value 12 not included).
You should be able to do a boolean comparison, IIUC, whether the dtype
is a string or object
:
>>> a = np.array(['True', 'False', 'True', 'False'])
>>> a
array(['True', 'False', 'True', 'False'],
dtype='|S5')
>>> a == "True"
array([ True, False, True, False], dtype=bool)
or
>>> a = np.array(['True', 'False', 'True', 'False'], dtype=object)
>>> a
array(['True', 'False', 'True', 'False'], dtype=object)
>>> a == "True"
array([ True, False, True, False], dtype=bool)
I've found a method that's even faster than DSM's, taking inspiration from Eric, though the improvement is best seen with smaller lists of values; at very large values, the cost of the iterating itself starts to outweigh the advantage of performing the truth testing during creation of the numpy array rather than after. Testing with both is
and ==
(for situations where the strings are interned versus when they might not be, as is
would not work with non-interned strings. As 'True'
is probably going to be a literal in the script it should be interned, though) showed that while my version with ==
was slower than with is
, it was still much faster than DSM's version.
Test setup:
import timeit
def timer(statement, count):
return timeit.repeat(statement, "from random import choice;import numpy as np;x = [choice(['True', 'False']) for i in range(%i)]" % count)
>>> stateIs = "y = np.fromiter((e is 'True' for e in x), bool)"
>>> stateEq = "y = np.fromiter((e == 'True' for e in x), bool)"
>>> stateDSM = "y = np.array(x) == 'True'"
With 1000 items, the faster statements take about 66% the time of DSM's:
>>> timer(stateIs, 1000)
[101.77722641656146, 100.74985342340369, 101.47228618107965]
>>> timer(stateEq, 1000)
[112.26464996250706, 112.50754567379681, 112.76057346127709]
>>> timer(stateDSM, 1000)
[155.67689949529995, 155.96820504501557, 158.32394669279802]
For smaller string arrays (in the hundreds rather than thousands), the elapsed time is less than 50% of DSM's:
>>> timer(stateIs, 100)
[11.947757485669172, 11.927990253608186, 12.057855628259858]
>>> timer(stateEq, 100)
[13.064947253943501, 13.161545451986967, 13.30599035623618]
>>> timer(stateDSM, 100)
[31.270060799078237, 30.941749748808434, 31.253922641324607]
A bit over 25% of DSM's when done with 50 items per list:
>>> timer(stateIs, 50)
[6.856538342483873, 6.741083326021908, 6.708402786859551]
>>> timer(stateEq, 50)
[7.346079345032194, 7.312723444475523, 7.309259899921017]
>>> timer(stateDSM, 50)
[24.154247576229864, 24.173593700599667, 23.946403452288905]
For 5 items, about 11% of DSM's:
>>> timer(stateIs, 5)
[1.8826215278058953, 1.850232652068371, 1.8559381315990322]
>>> timer(stateEq, 5)
[1.9252821868467436, 1.894011299061276, 1.894306935199893]
>>> timer(stateDSM, 5)
[18.060974208809057, 17.916322392367874, 17.8379771602049]
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With