Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Map a NumPy array of strings to integers

Problem:

Given an array of string data

dataSet = np.array(['kevin', 'greg', 'george', 'kevin'], dtype='U21'), 

I would like a function that returns the indexed dataset

indexed_dataSet = np.array([0, 1, 2, 0], dtype='int')

and a lookup table

lookupTable = np.array(['kevin', 'greg', 'george'], dtype='U21')

such that

(lookupTable[indexed_dataSet] == dataSet).all()

is true. Note that the indexed_dataSet and lookupTable can both be permuted such that the above holds and that is fine (i.e. it is not necessary that the order of lookupTable is equivalent to the order of first appearance in dataSet).

Slow Solution:

I currently have the following slow solution

def indexDataSet(dataSet):
    """Returns the indexed dataSet and a lookup table
       Input:
           dataSet         : A length n numpy array to be indexed
       Output:
           indexed_dataSet : A length n numpy array containing values in {0, len(set(dataSet))-1}
           lookupTable     : A lookup table such that lookupTable[indexed_Dataset] = dataSet"""
    labels = set(dataSet)
    lookupTable = np.empty(len(labels), dtype='U21')
    indexed_dataSet = np.zeros(dataSet.size, dtype='int')
    count = -1
    for label in labels:
        count += 1
        indexed_dataSet[np.where(dataSet == label)] = count
        lookupTable[count] = label

    return indexed_dataSet, lookupTable

Is there a quicker way to do this? I feel like I am not using numpy to its full potential here.

like image 594
rwolst Avatar asked Apr 17 '16 12:04

rwolst


People also ask

Can you have a NumPy array of strings?

The elements of a NumPy array, or simply an array, are usually numbers, but can also be boolians, strings, or other objects.

Does NumPy have a map function?

Method 1: numpy.The numpy. vectorize() function maps functions on data structures that contain a sequence of objects like NumPy arrays.

How is it possible to cast an array into different data types like float?

Method 1 : Here, we can utilize the astype() function that is offered by NumPy. This function creates another copy of the initial array with the specified data type, float in this case, and we can then assign this copy to a specific identifier, which is convertedArray.


2 Answers

You can use np.unique with the return_inverse argument:

>>> lookupTable, indexed_dataSet = np.unique(dataSet, return_inverse=True)
>>> lookupTable
array(['george', 'greg', 'kevin'], 
      dtype='<U21')
>>> indexed_dataSet
array([2, 1, 0, 2])

If you like, you can reconstruct your original array from these two arrays:

>>> lookupTable[indexed_dataSet]
array(['kevin', 'greg', 'george', 'kevin'], 
      dtype='<U21')

If you use pandas, lookupTable, indexed_dataSet = pd.factorize(dataSet) will achieve the same thing (and potentially be more efficient for large arrays).

like image 117
Alex Riley Avatar answered Sep 28 '22 03:09

Alex Riley


np.searchsorted does the trick:

dataSet = np.array(['kevin', 'greg', 'george', 'kevin'], dtype='U21'), 
lut = np.sort(np.unique(dataSet))  # [u'george', u'greg', u'kevin']
ind = np.searchsorted(lut,dataSet) # array([[2, 1, 0, 2]])
like image 41
Bob Baxley Avatar answered Sep 28 '22 04:09

Bob Baxley