Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas - Convert a categorical column to binary encoded form

Tags:

python

pandas

I have a dataset that looks like so -

     yyyy      month        tmax         tmin
0    1908    January         5.0         -1.4
1    1908   February         7.3          1.9
2    1908      March         6.2          0.3
3    1908      April         7.4          2.1
4    1908        May        16.5          7.7
5    1908       June        17.7          8.7
6    1908       July        20.1         11.0
7    1908     August        17.5          9.7
8    1908  September        16.3          8.4
9    1908    October        14.6          8.0
10   1908   November         9.6          3.4
11   1908   December         5.8         -0.3
12   1909    January         5.0          0.1
13   1909   February         5.5         -0.3
14   1909      March         5.6         -0.3
15   1909      April        12.2          3.3
16   1909        May        14.7          4.8
17   1909       June        15.0          7.5
18   1909       July        17.3         10.8
19   1909     August        18.8         10.7
20   1909  September        14.5          8.1
21   1909    October        12.9          6.9
22   1909   November         7.5          1.7
23   1909   December         5.3          0.4
24   1910    January         5.2         -0.5
...

It has four variables - yyyy, month, tmax(maximum temperature) and tmin

I want to use the month column as a variable while predictions and so want to convert it to its binary encoded version. Essentially, I want to add twelve variables to the dataset named January until December and if a particular row has month as "January" then the column January should be marked as 1 and the remaining of the newly added 11 columns should be 0.

I looked into pivot tables but that doesn't help my cause. Any ideas on how to do this in a simple elegant way?

like image 450
Clock Slave Avatar asked Jul 31 '17 12:07

Clock Slave


People also ask

How do you change categorical data to numerical data in Python?

We will be using . LabelEncoder() from sklearn library to convert categorical data to numerical data. We will use function fit_transform() in the process.

How do you convert categorical variables to dummy variables in pandas?

To convert your categorical variables to dummy variables in Python you c an use Pandas get_dummies() method. For example, if you have the categorical variable “Gender” in your dataframe called “df” you can use the following code to make dummy variables: df_dc = pd. get_dummies(df, columns=['Gender']) .

How do you convert data to binary in Python?

In Python, you can simply use the bin() function to convert from a decimal value to its corresponding binary value. And similarly, the int() function to convert a binary to its decimal value. The int() function takes as second argument the base of the number to be converted, which is 2 in case of binary numbers.


Video Answer


2 Answers

I think you need get_dummies:

df = pd.get_dummies(df['month'])

And if need add new columns to original and remove month use join with pop:

df2 = df.join(pd.get_dummies(df.pop('month')))
print (df2.head())
   yyyy  tmax  tmin  April  August  December  February  January  July  June  \
0  1908   5.0  -1.4      0       0         0         0        1     0     0   
1  1908   7.3   1.9      0       0         0         1        0     0     0   
2  1908   6.2   0.3      0       0         0         0        0     0     0   
3  1908   7.4   2.1      1       0         0         0        0     0     0   
4  1908  16.5   7.7      0       0         0         0        0     0     0   

   March  May  November  October  September  
0      0    0         0        0          0  
1      0    0         0        0          0  
2      1    0         0        0          0  
3      0    0         0        0          0  
4      0    1         0        0          0  

If NOT need remove column month:

df2 = df.join(pd.get_dummies(df['month']))
print (df2.head())
   yyyy     month  tmax  tmin  April  August  December  February  January  \
0  1908   January   5.0  -1.4      0       0         0         0        1   
1  1908  February   7.3   1.9      0       0         0         1        0   
2  1908     March   6.2   0.3      0       0         0         0        0   
3  1908     April   7.4   2.1      1       0         0         0        0   
4  1908       May  16.5   7.7      0       0         0         0        0   

   July  June  March  May  November  October  September  
0     0     0      0    0         0        0          0  
1     0     0      0    0         0        0          0  
2     0     0      1    0         0        0          0  
3     0     0      0    0         0        0          0  
4     0     0      0    1         0        0          0  

If need sort columns there is more possible solutions - use reindex or reindex_axis:

months = ['January', 'February', 'March','April' ,'May',  'June', 'July', 'August', 'September','October', 'November','December']
df1 = pd.get_dummies(df['month']).reindex_axis(months, 1)
print (df1.head())
   January  February  March  April  May  June  July  August  September  \
0        1         0      0      0    0     0     0       0          0   
1        0         1      0      0    0     0     0       0          0   
2        0         0      1      0    0     0     0       0          0   
3        0         0      0      1    0     0     0       0          0   
4        0         0      0      0    1     0     0       0          0   

   October  November  December  
0        0         0         0  
1        0         0         0  
2        0         0         0  
3        0         0         0  
4        0         0         0  

df1 = pd.get_dummies(df['month']).reindex(columns=months)
print (df1.head())
   January  February  March  April  May  June  July  August  September  \
0        1         0      0      0    0     0     0       0          0   
1        0         1      0      0    0     0     0       0          0   
2        0         0      1      0    0     0     0       0          0   
3        0         0      0      1    0     0     0       0          0   
4        0         0      0      0    1     0     0       0          0   

   October  November  December  
0        0         0         0  
1        0         0         0  
2        0         0         0  
3        0         0         0  
4        0         0         0  

Or convert column month to ordered categorical:

df1 = pd.get_dummies(df['month'].astype('category', categories=months, ordered=True))
print (df1.head())
   January  February  March  April  May  June  July  August  September  \
0        1         0      0      0    0     0     0       0          0   
1        0         1      0      0    0     0     0       0          0   
2        0         0      1      0    0     0     0       0          0   
3        0         0      0      1    0     0     0       0          0   
4        0         0      0      0    1     0     0       0          0   

   October  November  December  
0        0         0         0  
1        0         0         0  
2        0         0         0  
3        0         0         0  
4        0         0         0  
like image 176
jezrael Avatar answered Nov 14 '22 21:11

jezrael


IIUC,

You can use assign, ** unpacking operator, and pd.get_dummies:

df.assign(**pd.get_dummies(df['month']))

Output:

    yyyy      month  tmax  tmin  April  August  December  February  January  \
0   1908    January   5.0  -1.4      0       0         0         0        1   
1   1908   February   7.3   1.9      0       0         0         1        0   
2   1908      March   6.2   0.3      0       0         0         0        0   
3   1908      April   7.4   2.1      1       0         0         0        0   
4   1908        May  16.5   7.7      0       0         0         0        0   
5   1908       June  17.7   8.7      0       0         0         0        0   
6   1908       July  20.1  11.0      0       0         0         0        0   
7   1908     August  17.5   9.7      0       1         0         0        0   
8   1908  September  16.3   8.4      0       0         0         0        0   
9   1908    October  14.6   8.0      0       0         0         0        0   
10  1908   November   9.6   3.4      0       0         0         0        0   
11  1908   December   5.8  -0.3      0       0         1         0        0   
12  1909    January   5.0   0.1      0       0         0         0        1   
13  1909   February   5.5  -0.3      0       0         0         1        0   
14  1909      March   5.6  -0.3      0       0         0         0        0   
15  1909      April  12.2   3.3      1       0         0         0        0   
16  1909        May  14.7   4.8      0       0         0         0        0   
17  1909       June  15.0   7.5      0       0         0         0        0   
18  1909       July  17.3  10.8      0       0         0         0        0   
19  1909     August  18.8  10.7      0       1         0         0        0   
20  1909  September  14.5   8.1      0       0         0         0        0   
21  1909    October  12.9   6.9      0       0         0         0        0   
22  1909   November   7.5   1.7      0       0         0         0        0   
23  1909   December   5.3   0.4      0       0         1         0        0   
24  1910    January   5.2  -0.5      0       0         0         0        1   

    July  June  March  May  November  October  September  
0      0     0      0    0         0        0          0  
1      0     0      0    0         0        0          0  
2      0     0      1    0         0        0          0  
3      0     0      0    0         0        0          0  
4      0     0      0    1         0        0          0  
5      0     1      0    0         0        0          0  
6      1     0      0    0         0        0          0  
7      0     0      0    0         0        0          0  
8      0     0      0    0         0        0          1  
9      0     0      0    0         0        1          0  
10     0     0      0    0         1        0          0  
11     0     0      0    0         0        0          0  
12     0     0      0    0         0        0          0  
13     0     0      0    0         0        0          0  
14     0     0      1    0         0        0          0  
15     0     0      0    0         0        0          0  
16     0     0      0    1         0        0          0  
17     0     1      0    0         0        0          0  
18     1     0      0    0         0        0          0  
19     0     0      0    0         0        0          0  
20     0     0      0    0         0        0          1  
21     0     0      0    0         0        1          0  
22     0     0      0    0         1        0          0  
23     0     0      0    0         0        0          0  
24     0     0      0    0         0        0          0 
like image 21
Scott Boston Avatar answered Nov 14 '22 22:11

Scott Boston