Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas get dummies() for numeric categorical data

I have 2 columns:

  • Sex (with categorical values of type string as 'male' and 'female')
  • Class (with categorical values of type integer as 1 to 10)

When I execute pd.get_dummies() on the above 2 columns, only 'Sex' is getting encoded into 2 columns. But 'Class' is not converted by get_dummies function.

I want 'Class' to be converted into 10 dummy columns as well, similar to One Hot Encoding.

Is this expected behavior? Is there an workaround?

like image 684
Supratim Haldar Avatar asked Feb 07 '19 08:02

Supratim Haldar


People also ask

Why pandas uses object data type to indicate categorical variables/ columns?

Pandas uses the object data type to indicate categorical variables/columns because there are categorical (non-numerical) columns and we need to transform them. For this, we will implement get_dummies.

What are dummy variables in pandas?

We frequently call these 0/1 variables “dummy” variables, but they are also sometimes called indicator variables. In machine learning, this is also sometimes referred to as “one-hot” encoding of categorical data. Now that you understand what dummy variables are, let’s talk about the Pandas get_dummies function.

How to one-hot encode categorical data in pandas?

The Pandas get dummies function, pd.get_dummies (), allows you to easily one-hot encode your categorical data. In this tutorial, you’ll learn how to use the Pandas get_dummies function works and how to customize it. One-hot encoding is a common preprocessing step for categorical data in machine learning.

What is the difference between one-hot and dummy encoding in pandas?

One-hot encoding converts a column into n variables, while dummy encoding creates n-1 variables. However, Pandas by default will one-hot encode your data. This can be modified by using the drop_first parameter. To learn more about related topics, check out the tutorials below:


2 Answers

You can convert values to strings:

df1 = pd.get_dummies(df.astype(str))
like image 168
jezrael Avatar answered Oct 15 '22 20:10

jezrael


If you don't want to convert your data, you can use 'columns' argument in get_dummies. Here is quick walkthrough:

Here is the data frame reproduced per your description:

sex_labels = ['male', 'female']
sex_col = [sex_labels[i%2] for i in range(10)]
class_col = [i for i in range(10)]
df = pd.DataFrame({'sex':sex_cols, 'class':class_col})
df.sex = pd.Categorical(df.sex)

The dtypes are:

print(df.dtypes)
sex      category
class       int64
dtype: object

Apply get_dummies:

df = pd.get_dummies(df, columns=['sex', 'class'])

Verify:

print(df.columns)

Output:

Index(['sex_female', 'sex_male', 'class_0',
'class_1','class_2','class_3','class_4','class_5',
'class_6','class_7','class_8','class_9'],dtype='object')

Per the docs at, https://pandas.pydata.org/pandasdocs/stable/reference/api/pandas.get_dummies.html,

If columns is None then all the columns with object or category dtype will be converted

This is the reason you only see dummies for sex column and not for class.

Hope this helps. Happy learning!

Note: Tested with pandas version '0.25.2'

like image 20
Sid Avatar answered Oct 15 '22 22:10

Sid