I am trying to do something very similar to that previous question but I get an error. I have a pandas dataframe containing features,label I need to do some convertion to send the features and the label variable into a machine learning object:
import pandas
import milk
from scikits.statsmodels.tools import categorical
then I have:
trainedData=bigdata[bigdata['meta']<15]
untrained=bigdata[bigdata['meta']>=15]
#print trainedData
#extract two columns from trainedData
#convert to numpy array
features=trainedData.ix[:,['ratio','area']].as_matrix(['ratio','area'])
un_features=untrained.ix[:,['ratio','area']].as_matrix(['ratio','area'])
print 'features'
print features[:5]
##label is a string:single, touching,nuclei,dust
print 'labels'
labels=trainedData.ix[:,['type']].as_matrix(['type'])
print labels[:5]
#convert single to 0, touching to 1, nuclei to 2, dusts to 3
#
tmp=categorical(labels,drop=True)
targets=categorical(labels,drop=True).argmax(1)
print targets
The output console yields first:
features
[[ 0.38846334 0.97681855]
[ 3.8318634 0.5724734 ]
[ 0.67710876 1.01816444]
[ 1.12024943 0.91508699]
[ 7.51749674 1.00156707]]
labels
[[single]
[touching]
[single]
[single]
[nuclei]]
I meet then the following error:
Traceback (most recent call last):
File "/home/claire/Applications/ProjetPython/projet particule et objet/karyotyper/DAPI-Trainer02-MILK.py", line 83, in <module>
tmp=categorical(labels,drop=True)
File "/usr/local/lib/python2.6/dist-packages/scikits.statsmodels-0.3.0rc1-py2.6.egg/scikits/statsmodels/tools/tools.py", line 206, in categorical
tmp_dummy = (tmp_arr[:,None]==data).astype(float)
AttributeError: 'bool' object has no attribute 'astype'
Is it possible to convert the category variable 'type' within the dataframe into int type ? 'type' can take the values 'single', 'touching','nuclei','dusts' and I need to convert with int values such 0, 1, 2, 3.
You can convert a String to integer using the parseInt() method of the Integer class. To convert a string array to an integer array, convert each element of it to integer and populate the integer array with them.
To convert, or cast, a string to an integer in Python, you use the int() built-in function. The function takes in as a parameter the initial string you want to convert, and returns the integer equivalent of the value you passed. The general syntax looks something like this: int("str") .
Method 1: Using replace() method Replacing is one of the methods to convert categorical terms into numeric. For example, We will take a dataset of people's salaries based on their level of education. This is an ordinal type of categorical variable. We will convert their education levels into numeric terms.
The previous answers are outdated, so here is a solution for mapping strings to numbers that works with version 0.18.1 of Pandas.
For a Series:
In [1]: import pandas as pd
In [2]: s = pd.Series(['single', 'touching', 'nuclei', 'dusts',
'touching', 'single', 'nuclei'])
In [3]: s_enc = pd.factorize(s)
In [4]: s_enc[0]
Out[4]: array([0, 1, 2, 3, 1, 0, 2])
In [5]: s_enc[1]
Out[5]: Index([u'single', u'touching', u'nuclei', u'dusts'], dtype='object')
For a DataFrame:
In [1]: import pandas as pd
In [2]: df = pd.DataFrame({'labels': ['single', 'touching', 'nuclei',
'dusts', 'touching', 'single', 'nuclei']})
In [3]: catenc = pd.factorize(df['labels'])
In [4]: catenc
Out[4]: (array([0, 1, 2, 3, 1, 0, 2]),
Index([u'single', u'touching', u'nuclei', u'dusts'],
dtype='object'))
In [5]: df['labels_enc'] = catenc[0]
In [6]: df
Out[4]:
labels labels_enc
0 single 0
1 touching 1
2 nuclei 2
3 dusts 3
4 touching 1
5 single 0
6 nuclei 2
If you have a vector of strings or other objects and you want to give it categorical labels, you can use the Factor
class (available in the pandas
namespace):
In [1]: s = Series(['single', 'touching', 'nuclei', 'dusts', 'touching', 'single', 'nuclei'])
In [2]: s
Out[2]:
0 single
1 touching
2 nuclei
3 dusts
4 touching
5 single
6 nuclei
Name: None, Length: 7
In [4]: Factor(s)
Out[4]:
Factor:
array([single, touching, nuclei, dusts, touching, single, nuclei], dtype=object)
Levels (4): [dusts nuclei single touching]
The factor has attributes labels
and levels
:
In [7]: f = Factor(s)
In [8]: f.labels
Out[8]: array([2, 3, 1, 0, 3, 2, 1], dtype=int32)
In [9]: f.levels
Out[9]: Index([dusts, nuclei, single, touching], dtype=object)
This is intended for 1D vectors so not sure if it can be instantly applied to your problem, but have a look.
BTW I recommend that you ask these questions on the statsmodels and / or scikit-learn mailing list since most of us are not frequent SO users.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With