<p>I am trying to run some Machine learning algo on a dataset using scikit-learn. My dataset has some features which are like categories. Like one feature is <code>A</code>, which has values <code>1,2,3</code> specifying the quality of something. <code>1:Upper, 2: Second, 3: Third class</code>. So it's an ordinal variable. </p> <p>Similarly I re-coded a variable <code>City</code>, having three values <code>('London', Zurich', 'New York'</code> into <code>1,2,3</code> but with no specific preference for the values. So now this is a nominal categorical variable. </p> <p>How do I specify the algorithm to consider these as categorical and ordinal etc. in pandas?. Like in R, a categorical variable is specified by <code>factor(a)</code> and hence is not considered a continuous value. Is there anything like that in pandas/python?</p>

<p>... years later (and because I think a good explanation of these issues is required not only for this question but to help remind myself in the future)</p> <h3>Ordinal vs. Nominal</h3> <p>In general, one would translate categorical variables into dummy variables (or a host of other methodologies), <strong>because</strong> they were nominal, e.g. they had <strong>no</strong> sense of <code> a > b > c</code> . In OPs original question, this would <strong>only</strong> be performed on the Cities, like London, Zurich, New York.</p> <h3>Dummy Variables for Nominal</h3> <p>For this type of issue, <code>pandas</code> provides -- by far -- the easiest transformation using <code>pandas.get_dummies</code>. So:</p> <pre class="prettyprint"><code># create a sample of OPs unique values series = pandas.Series( numpy.random.randint(low=0, high=3, size=100)) mapper = {0: 'New York', 1: 'London', 2: 'Zurich'} nomvar = series.replace(mapper) # now let's use pandas.get_dummies print( pandas.get_dummies(series.replace(mpr)) Out[57]: London New York Zurich 0 0 0 1 1 0 1 0 2 0 1 0 3 1 0 0 </code></pre> <h3>Ordinal Encoding for Categorical Variables</h3> <p>However in the case of ordinal variables, the user must be cautious in using <code>pandas.factorize</code>. The reason is that the engineer wants to preserve the relationship in the mapping such that <code> a > b > c</code>.</p> <p>So if I want to take a set of categorical variables where <code>large > medium > small</code>, and preserve that, I need to make sure that <code>pandas.factorize</code> preserves that relationship.</p> <pre class="prettyprint"><code># leveraging the variables already created above mapper = {0: 'small', 1: 'medium', 2: 'large'} ordvar = series.replace(mapper) print(pandas.factorize(ordvar)) Out[58]: (array([0, 1, 1, 2, 1,... 0, 0]), Index(['large', 'small', 'medium'], dtype='object')) </code></pre> <p>In fact, the relationship that <strong>needs to be preserved in order to maintain the concept of ordinal</strong> has been lost using <code>pandas.factorize</code>. In an instance like this, I use my own mappings to ensure that the ordinal attributes are preserved.</p> <pre class="prettyprint"><code>preserved_mapper = {'large':2 , 'medium': 1, 'small': 0} ordvar.replace(preserved_mapper) print(ordvar.replace(preserved_mapper)) Out[78]: 0 2 1 0 ... 99 2 dtype: int64 </code></pre> <p>In fact, by creating your own <code>dict</code> to map the values is a way to not only preserve your desired ordinal relationship but also can be used as "keeping the contents and mappings of your prediction algorithm organized" ensuring that not only have you not lost any ordinal information in the process, but also have stored records of what each mapping for each variable is.</p> <h3> <code>int</code>s into <code>sklearn</code> </h3> <p>Lastly, the OP spoke about passing the information into <code>scikit-lean</code> classifiers, which means that <code>int</code>s are required. For that case, make sure you're aware of the <code>astype(int)</code> gotcha that is detailed here if you have any <code>NaN</code>s in your data.</p>

How to specify a variable in pandas as ordinal/categorical?

Tags:

python

pandas

scikit-learn

categorical-data

I am trying to run some Machine learning algo on a dataset using scikit-learn. My dataset has some features which are like categories. Like one feature is A, which has values 1,2,3 specifying the quality of something. 1:Upper, 2: Second, 3: Third class. So it's an ordinal variable.

Similarly I re-coded a variable City, having three values ('London', Zurich', 'New York' into 1,2,3 but with no specific preference for the values. So now this is a nominal categorical variable.

How do I specify the algorithm to consider these as categorical and ordinal etc. in pandas?. Like in R, a categorical variable is specified by factor(a) and hence is not considered a continuous value. Is there anything like that in pandas/python?

318

asked Apr 09 '15 02:04

Baktaawar

1 Answers

... years later (and because I think a good explanation of these issues is required not only for this question but to help remind myself in the future)

Ordinal vs. Nominal

In general, one would translate categorical variables into dummy variables (or a host of other methodologies), because they were nominal, e.g. they had no sense of a > b > c . In OPs original question, this would only be performed on the Cities, like London, Zurich, New York.

Dummy Variables for Nominal

For this type of issue, pandas provides -- by far -- the easiest transformation using pandas.get_dummies. So:

# create a sample of OPs unique values
series = pandas.Series(
           numpy.random.randint(low=0, high=3, size=100))
mapper = {0: 'New York', 1: 'London', 2: 'Zurich'}
nomvar = series.replace(mapper)

# now let's use pandas.get_dummies
print(
    pandas.get_dummies(series.replace(mpr))

Out[57]:
    London  New York  Zurich
0        0         0       1
1        0         1       0
2        0         1       0
3        1         0       0

Ordinal Encoding for Categorical Variables

However in the case of ordinal variables, the user must be cautious in using pandas.factorize. The reason is that the engineer wants to preserve the relationship in the mapping such that a > b > c.

So if I want to take a set of categorical variables where large > medium > small, and preserve that, I need to make sure that pandas.factorize preserves that relationship.

# leveraging the variables already created above
mapper = {0: 'small', 1: 'medium', 2: 'large'}
ordvar = series.replace(mapper)

print(pandas.factorize(ordvar))

Out[58]:
(array([0, 1, 1, 2, 1,...  0, 0]),
Index(['large', 'small', 'medium'], dtype='object'))

In fact, the relationship that needs to be preserved in order to maintain the concept of ordinal has been lost using pandas.factorize. In an instance like this, I use my own mappings to ensure that the ordinal attributes are preserved.

preserved_mapper = {'large':2 , 'medium': 1, 'small': 0}
ordvar.replace(preserved_mapper)
print(ordvar.replace(preserved_mapper))

Out[78]:
0     2
1     0
...
99    2
dtype: int64

In fact, by creating your own dict to map the values is a way to not only preserve your desired ordinal relationship but also can be used as "keeping the contents and mappings of your prediction algorithm organized" ensuring that not only have you not lost any ordinal information in the process, but also have stored records of what each mapping for each variable is.

`int`s into `sklearn`

Lastly, the OP spoke about passing the information into scikit-lean classifiers, which means that ints are required. For that case, make sure you're aware of the astype(int) gotcha that is detailed here if you have any NaNs in your data.

answered Sep 20 '22 18:09

benjaminmgross

Related questions
                            
                                testing if a numpy array is symmetric?
                            
                                Python interp1d vs. UnivariateSpline
                            
                                replace all "\" with "\\" python
                            
                                How to install MatPlotLib on Mac 10.7 in virtualenv
                            
                                permission change of files in python
                            
                                3D/4D graphics with Python and wxPython?
                            
                                Install "scientific python" environment: OS X 10.7 + Numpy + Scipy + Matplotlib
                            
                                Optparser-print Usage Help when no argument is given
                            
                                Numpy: How to randomly split/select an matrix into n-different matrices
                            
                                Python 2.7 and 3.3.2, why int('0.0') does not work?
                            
                                how to remove task from celery with redis broker?
                            
                                Is Python's dict.pop atomic?
                            
                                Running selenium behind a proxy server
                            
                                How to write a dictionary into an existing file?
                            
                                Flask - ImportError: No module named migrate.versioning
                            
                                Python - nohup.out don't show print statement
                            
                                Indentation not working properly in emacs for python
                            
                                Tornado coroutine
                            
                                how to make post request in python
                            
                                unconverted data remains: .387000 in Python

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to specify a variable in pandas as ordinal/categorical?

Tags:

python

pandas

scikit-learn

categorical-data

Baktaawar

People also ask

1 Answers

Ordinal vs. Nominal

Dummy Variables for Nominal

Ordinal Encoding for Categorical Variables

`int`s into `sklearn`

benjaminmgross

Recent Activity

Donate For Us

How to specify a variable in pandas as ordinal/categorical?

Tags:

python

pandas

scikit-learn

categorical-data

Baktaawar

People also ask

1 Answers

Ordinal vs. Nominal

Dummy Variables for Nominal

Ordinal Encoding for Categorical Variables

ints into sklearn

benjaminmgross

Related questions

Recent Activity

Donate For Us

`int`s into `sklearn`