Map a NumPy array of strings to integers

Tags:

Problem:

Given an array of string data

dataSet = np.array(['kevin', 'greg', 'george', 'kevin'], dtype='U21'),

I would like a function that returns the indexed dataset

indexed_dataSet = np.array([0, 1, 2, 0], dtype='int')

and a lookup table

lookupTable = np.array(['kevin', 'greg', 'george'], dtype='U21')

such that

(lookupTable[indexed_dataSet] == dataSet).all()

is true. Note that the indexed_dataSet and lookupTable can both be permuted such that the above holds and that is fine (i.e. it is not necessary that the order of lookupTable is equivalent to the order of first appearance in dataSet).

Slow Solution:

I currently have the following slow solution

def indexDataSet(dataSet):
    """Returns the indexed dataSet and a lookup table
       Input:
           dataSet         : A length n numpy array to be indexed
       Output:
           indexed_dataSet : A length n numpy array containing values in {0, len(set(dataSet))-1}
           lookupTable     : A lookup table such that lookupTable[indexed_Dataset] = dataSet"""
    labels = set(dataSet)
    lookupTable = np.empty(len(labels), dtype='U21')
    indexed_dataSet = np.zeros(dataSet.size, dtype='int')
    count = -1
    for label in labels:
        count += 1
        indexed_dataSet[np.where(dataSet == label)] = count
        lookupTable[count] = label

    return indexed_dataSet, lookupTable

Is there a quicker way to do this? I feel like I am not using numpy to its full potential here.

594

asked Apr 17 '16 12:04

rwolst

2 Answers

You can use np.unique with the return_inverse argument:

>>> lookupTable, indexed_dataSet = np.unique(dataSet, return_inverse=True)
>>> lookupTable
array(['george', 'greg', 'kevin'], 
      dtype='<U21')
>>> indexed_dataSet
array([2, 1, 0, 2])

If you like, you can reconstruct your original array from these two arrays:

>>> lookupTable[indexed_dataSet]
array(['kevin', 'greg', 'george', 'kevin'], 
      dtype='<U21')

If you use pandas, lookupTable, indexed_dataSet = pd.factorize(dataSet) will achieve the same thing (and potentially be more efficient for large arrays).

117

answered Sep 28 '22 03:09

Alex Riley

np.searchsorted does the trick:

dataSet = np.array(['kevin', 'greg', 'george', 'kevin'], dtype='U21'), 
lut = np.sort(np.unique(dataSet))  # [u'george', u'greg', u'kevin']
ind = np.searchsorted(lut,dataSet) # array([[2, 1, 0, 2]])

answered Sep 28 '22 04:09

Bob Baxley

Related questions
                            
                                Is there a way to define list(obj) method on a user defined class in python?
                            
                                Analytical solution for Linear Regression using Python vs. Julia
                            
                                multiprocessing.Pool with maxtasksperchild produces equal PIDs
                            
                                Python Enums with duplicate values
                            
                                Execute Python script from Php
                            
                                How to convert dictionary values to int in Python?
                            
                                Adjust width of box in boxplot in python matplotlib
                            
                                Flask Cache not caching
                            
                                Removing duplicates from Pandas dataFrame with condition for retaining original
                            
                                What is the smallest number which can be represented in python?
                            
                                How to sort by value efficiently in PySpark?
                            
                                How to access current user in Django class based view
                            
                                Find strings in list using wildcard
                            
                                AttributeError: 'str' object has no attribute 'regex' django 1.9
                            
                                Make a list of ints hashable in python
                            
                                what is the best way to save tuples in python
                            
                                Oauth2 lib cannot import name 'run'
                            
                                HSV2BGR conversion fails in Python OpenCV script
                            
                                selenium webdriver takes too long to load a page
                            
                                Displaying only one tooltip when using the HoverTool() tool

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Map a NumPy array of strings to integers

Tags:

performance

python

arrays

string

numpy

rwolst

People also ask

2 Answers

Alex Riley

Bob Baxley

Recent Activity

Donate For Us