I am trying to run some Machine learning algo on a dataset using scikit-learn. My dataset has some features which are like categories. Like one feature is A
, which has values 1,2,3
specifying the quality of something. 1:Upper, 2: Second, 3: Third class
. So it's an ordinal variable.
Similarly I re-coded a variable City
, having three values ('London', Zurich', 'New York'
into 1,2,3
but with no specific preference for the values. So now this is a nominal categorical variable.
How do I specify the algorithm to consider these as categorical and ordinal etc. in pandas?. Like in R, a categorical variable is specified by factor(a)
and hence is not considered a continuous value. Is there anything like that in pandas/python?
This type of categorical variable is called an ordinal variable because the values can be ordered or ranked. A numerical variable can be converted to an ordinal variable by dividing the range of the numerical variable into bins and assigning values to each bin.
In statistics, ordinal and nominal variables are both considered categorical variables. Even though ordinal data can sometimes be numerical, not all mathematical operations can be performed on them.
The factor() function also allows you to assign an order to the nominal variables, thus making them ordinal variables. This is done by setting the order parameter to TRUE and by assigning a vector with the desired level hierarchy to the argument levels .
... years later (and because I think a good explanation of these issues is required not only for this question but to help remind myself in the future)
In general, one would translate categorical variables into dummy variables (or a host of other methodologies), because they were nominal, e.g. they had no sense of a > b > c
. In OPs original question, this would only be performed on the Cities, like London, Zurich, New York.
For this type of issue, pandas
provides -- by far -- the easiest transformation using pandas.get_dummies
. So:
# create a sample of OPs unique values
series = pandas.Series(
numpy.random.randint(low=0, high=3, size=100))
mapper = {0: 'New York', 1: 'London', 2: 'Zurich'}
nomvar = series.replace(mapper)
# now let's use pandas.get_dummies
print(
pandas.get_dummies(series.replace(mpr))
Out[57]:
London New York Zurich
0 0 0 1
1 0 1 0
2 0 1 0
3 1 0 0
However in the case of ordinal variables, the user must be cautious in using pandas.factorize
. The reason is that the engineer wants to preserve the relationship in the mapping such that a > b > c
.
So if I want to take a set of categorical variables where large > medium > small
, and preserve that, I need to make sure that pandas.factorize
preserves that relationship.
# leveraging the variables already created above
mapper = {0: 'small', 1: 'medium', 2: 'large'}
ordvar = series.replace(mapper)
print(pandas.factorize(ordvar))
Out[58]:
(array([0, 1, 1, 2, 1,... 0, 0]),
Index(['large', 'small', 'medium'], dtype='object'))
In fact, the relationship that needs to be preserved in order to maintain the concept of ordinal has been lost using pandas.factorize
. In an instance like this, I use my own mappings to ensure that the ordinal attributes are preserved.
preserved_mapper = {'large':2 , 'medium': 1, 'small': 0}
ordvar.replace(preserved_mapper)
print(ordvar.replace(preserved_mapper))
Out[78]:
0 2
1 0
...
99 2
dtype: int64
In fact, by creating your own dict
to map the values is a way to not only preserve your desired ordinal relationship but also can be used as "keeping the contents and mappings of your prediction algorithm organized" ensuring that not only have you not lost any ordinal information in the process, but also have stored records of what each mapping for each variable is.
int
s into sklearn
Lastly, the OP spoke about passing the information into scikit-lean
classifiers, which means that int
s are required. For that case, make sure you're aware of the astype(int)
gotcha that is detailed here if you have any NaN
s in your data.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With