Problem:
Given an array of string data
dataSet = np.array(['kevin', 'greg', 'george', 'kevin'], dtype='U21'),
I would like a function that returns the indexed dataset
indexed_dataSet = np.array([0, 1, 2, 0], dtype='int')
and a lookup table
lookupTable = np.array(['kevin', 'greg', 'george'], dtype='U21')
such that
(lookupTable[indexed_dataSet] == dataSet).all()
is true. Note that the indexed_dataSet
and lookupTable
can both be permuted such that the above holds and that is fine (i.e. it is not necessary that the order of lookupTable
is equivalent to the order of first appearance in dataSet
).
Slow Solution:
I currently have the following slow solution
def indexDataSet(dataSet):
"""Returns the indexed dataSet and a lookup table
Input:
dataSet : A length n numpy array to be indexed
Output:
indexed_dataSet : A length n numpy array containing values in {0, len(set(dataSet))-1}
lookupTable : A lookup table such that lookupTable[indexed_Dataset] = dataSet"""
labels = set(dataSet)
lookupTable = np.empty(len(labels), dtype='U21')
indexed_dataSet = np.zeros(dataSet.size, dtype='int')
count = -1
for label in labels:
count += 1
indexed_dataSet[np.where(dataSet == label)] = count
lookupTable[count] = label
return indexed_dataSet, lookupTable
Is there a quicker way to do this? I feel like I am not using numpy to its full potential here.
The elements of a NumPy array, or simply an array, are usually numbers, but can also be boolians, strings, or other objects.
Method 1: numpy.The numpy. vectorize() function maps functions on data structures that contain a sequence of objects like NumPy arrays.
Method 1 : Here, we can utilize the astype() function that is offered by NumPy. This function creates another copy of the initial array with the specified data type, float in this case, and we can then assign this copy to a specific identifier, which is convertedArray.
You can use np.unique
with the return_inverse
argument:
>>> lookupTable, indexed_dataSet = np.unique(dataSet, return_inverse=True)
>>> lookupTable
array(['george', 'greg', 'kevin'],
dtype='<U21')
>>> indexed_dataSet
array([2, 1, 0, 2])
If you like, you can reconstruct your original array from these two arrays:
>>> lookupTable[indexed_dataSet]
array(['kevin', 'greg', 'george', 'kevin'],
dtype='<U21')
If you use pandas, lookupTable, indexed_dataSet = pd.factorize(dataSet)
will achieve the same thing (and potentially be more efficient for large arrays).
np.searchsorted does the trick:
dataSet = np.array(['kevin', 'greg', 'george', 'kevin'], dtype='U21'),
lut = np.sort(np.unique(dataSet)) # [u'george', u'greg', u'kevin']
ind = np.searchsorted(lut,dataSet) # array([[2, 1, 0, 2]])
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With