I have an array of data, each row represents a sample of data (5 samples) and each column represents a feature in the data (6 features for each sample)
I'm trying to quantify the number of states each column contains, then map them to a set of numbers. This should only be done if the column is not currently numeric.
This is easier to explain through example:
example input (Input is of type numpy.ndarray):
In = array([['x', 's', 3, 'k', 's', 'u'],
['x', 's', 2, 'n', 'n', 'g'],
['b', 's', 0, 'n', 'n', 'm'],
['k', 'y', 1, 'w', 'v', 'l'],
['x', 's', 2, 'o', 'c', 'l']], dtype=object)
For first column
curr_column = 0
colset = set()
for row in In:
curr_element = row[curr_column]
if curr_element not in colset:
colset.add(curr_element)
#now colset = {'x', 'b', 'k'} so 3 possible states
collist = list(colset) #make it indexible
coldict = {}
for i in range(len(collist)):
coldict[collist[i]] = i
This produces a dictionary, so that I can now recreate the original data as such: (assuming coldict = {'x':0, 'b':1, 'k':2})
for i in range(len(In)): #loop over each row
curr_element = In[i][curr_column] #get current element
In[i][curr_column] = coldict[curr_element] #use it to find the numerical value
'''
now
In = array([[0, 's', 3, 'k', 's', 'u'],
[0, 's', 2, 'n', 'n', 'g'],
[1, 's', 0, 'n', 'n', 'm'],
[2, 'y', 1, 'w', 'v', 'l'],
[0, 's', 2, 'o', 'c', 'l']], dtype=object)
'''
Now repeat the process for every column.
I'm aware that I could speed this up by populating all the column dictionaries in one pass over the dataset, and then replacing values all in one loop as well. I left that out for clarity into the process.
This is horribly inefficient for space and time and takes a large amount of time on large data, in which ways could this algorithm be improved? Is there a mapping function in numpy or in pandas that could either accomplish this or aid me?
I considered something similar to
np.unique(Input, axis=1)
but I need this to be portable and not everyone has 1.13.0 developer version of numpy.
Also, how would I differentiate between columns that are numeric and ones that aren't to decide which columns I should apply this to?
Pandas also has a map function that you can use. So, if for example you have this dictionary that maps the strings to codes:
codes = {'x':0, 'b':1, 'k':2}
You can use the map function to map the column in the pandas dataframe:
df[col] = df[col].map(codes)
You can use Categorical codes. See Categorical section of the docs.
In [11]: df
Out[11]:
0 1 2 3 4 5
0 x s 3 k s u
1 x s 2 n n g
2 b s 0 n n m
In [12]: for col in df.columns:
...: df[col] = pd.Categorical(df[col], categories=df[col].unique()).codes
In [13]: df
Out[13]:
0 1 2 3 4 5
0 0 0 0 0 0 0
1 0 0 1 1 1 1
2 1 0 2 1 1 2
3 2 1 3 2 2 3
4 0 0 1 3 3 3
I suspect there's a small change which would allow doing this without passing the categories explicitly (Note: pandas does guarantee that .unique()
is in seen-order).
Note: To "differentiate between columns that are numeric and ones that aren't" you can use select_dtypes
before iterating:
for col in df.select_dtypes(exclude=['int']).columns:
...
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With