Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Convert array of string (category) to array of int from a pandas dataframe

I am trying to do something very similar to that previous question but I get an error. I have a pandas dataframe containing features,label I need to do some convertion to send the features and the label variable into a machine learning object:

import pandas
import milk
from scikits.statsmodels.tools import categorical

then I have:

trainedData=bigdata[bigdata['meta']<15]
untrained=bigdata[bigdata['meta']>=15]
#print trainedData
#extract two columns from trainedData
#convert to numpy array
features=trainedData.ix[:,['ratio','area']].as_matrix(['ratio','area'])
un_features=untrained.ix[:,['ratio','area']].as_matrix(['ratio','area'])
print 'features'
print features[:5]
##label is a string:single, touching,nuclei,dust
print 'labels'

labels=trainedData.ix[:,['type']].as_matrix(['type'])
print labels[:5]
#convert single to 0, touching to 1, nuclei to 2, dusts to 3
#
tmp=categorical(labels,drop=True)
targets=categorical(labels,drop=True).argmax(1)
print targets

The output console yields first:

features
[[ 0.38846334  0.97681855]
[ 3.8318634   0.5724734 ]
[ 0.67710876  1.01816444]
[ 1.12024943  0.91508699]
[ 7.51749674  1.00156707]]
labels
[[single]
[touching]
[single]
[single]
[nuclei]]

I meet then the following error:

Traceback (most recent call last):
File "/home/claire/Applications/ProjetPython/projet particule et objet/karyotyper/DAPI-Trainer02-MILK.py", line 83, in <module>
tmp=categorical(labels,drop=True)
File "/usr/local/lib/python2.6/dist-packages/scikits.statsmodels-0.3.0rc1-py2.6.egg/scikits/statsmodels/tools/tools.py", line 206, in categorical
tmp_dummy = (tmp_arr[:,None]==data).astype(float)
AttributeError: 'bool' object has no attribute 'astype'

Is it possible to convert the category variable 'type' within the dataframe into int type ? 'type' can take the values 'single', 'touching','nuclei','dusts' and I need to convert with int values such 0, 1, 2, 3.

like image 405
Jean-Pat Avatar asked Oct 18 '11 20:10

Jean-Pat


People also ask

How do you convert an array of strings into an array of integers?

You can convert a String to integer using the parseInt() method of the Integer class. To convert a string array to an integer array, convert each element of it to integer and populate the integer array with them.

How do you convert an array of strings to integers in Python?

To convert, or cast, a string to an integer in Python, you use the int() built-in function. The function takes in as a parameter the initial string you want to convert, and returns the integer equivalent of the value you passed. The general syntax looks something like this: int("str") .

How do you convert categorical data to numeric in pandas?

Method 1: Using replace() method Replacing is one of the methods to convert categorical terms into numeric. For example, We will take a dataset of people's salaries based on their level of education. This is an ordinal type of categorical variable. We will convert their education levels into numeric terms.


2 Answers

The previous answers are outdated, so here is a solution for mapping strings to numbers that works with version 0.18.1 of Pandas.

For a Series:

In [1]: import pandas as pd
In [2]: s = pd.Series(['single', 'touching', 'nuclei', 'dusts',
                       'touching', 'single', 'nuclei'])
In [3]: s_enc = pd.factorize(s)
In [4]: s_enc[0]
Out[4]: array([0, 1, 2, 3, 1, 0, 2])
In [5]: s_enc[1]
Out[5]: Index([u'single', u'touching', u'nuclei', u'dusts'], dtype='object')

For a DataFrame:

In [1]: import pandas as pd
In [2]: df = pd.DataFrame({'labels': ['single', 'touching', 'nuclei', 
                       'dusts', 'touching', 'single', 'nuclei']})
In [3]: catenc = pd.factorize(df['labels'])
In [4]: catenc
Out[4]: (array([0, 1, 2, 3, 1, 0, 2]), 
        Index([u'single', u'touching', u'nuclei', u'dusts'],
        dtype='object'))
In [5]: df['labels_enc'] = catenc[0]
In [6]: df
Out[4]:
         labels  labels_enc
    0    single           0
    1  touching           1
    2    nuclei           2
    3     dusts           3
    4  touching           1
    5    single           0
    6    nuclei           2
like image 190
tomp Avatar answered Oct 07 '22 20:10

tomp


If you have a vector of strings or other objects and you want to give it categorical labels, you can use the Factor class (available in the pandas namespace):

In [1]: s = Series(['single', 'touching', 'nuclei', 'dusts', 'touching', 'single', 'nuclei'])

In [2]: s
Out[2]: 
0    single
1    touching
2    nuclei
3    dusts
4    touching
5    single
6    nuclei
Name: None, Length: 7

In [4]: Factor(s)
Out[4]: 
Factor:
array([single, touching, nuclei, dusts, touching, single, nuclei], dtype=object)
Levels (4): [dusts nuclei single touching]

The factor has attributes labels and levels:

In [7]: f = Factor(s)

In [8]: f.labels
Out[8]: array([2, 3, 1, 0, 3, 2, 1], dtype=int32)

In [9]: f.levels
Out[9]: Index([dusts, nuclei, single, touching], dtype=object)

This is intended for 1D vectors so not sure if it can be instantly applied to your problem, but have a look.

BTW I recommend that you ask these questions on the statsmodels and / or scikit-learn mailing list since most of us are not frequent SO users.

like image 30
Wes McKinney Avatar answered Oct 07 '22 22:10

Wes McKinney