I have a dataset that looks like so - <pre class="prettyprint"><code> yyyy month tmax tmin 0 1908 January 5.0 -1.4 1 1908 February 7.3 1.9 2 1908 March 6.2 0.3 3 1908 April 7.4 2.1 4 1908 May 16.5 7.7 5 1908 June 17.7 8.7 6 1908 July 20.1 11.0 7 1908 August 17.5 9.7 8 1908 September 16.3 8.4 9 1908 October 14.6 8.0 10 1908 November 9.6 3.4 11 1908 December 5.8 -0.3 12 1909 January 5.0 0.1 13 1909 February 5.5 -0.3 14 1909 March 5.6 -0.3 15 1909 April 12.2 3.3 16 1909 May 14.7 4.8 17 1909 June 15.0 7.5 18 1909 July 17.3 10.8 19 1909 August 18.8 10.7 20 1909 September 14.5 8.1 21 1909 October 12.9 6.9 22 1909 November 7.5 1.7 23 1909 December 5.3 0.4 24 1910 January 5.2 -0.5 ... </code></pre> It has four variables - <code>yyyy</code>, <code>month</code>, <code>tmax</code>(maximum temperature) and <code>tmin</code> I want to use the month column as a variable while predictions and so want to convert it to its binary encoded version. Essentially, I want to add twelve variables to the dataset named <code>January</code> until <code>December</code> and if a particular row has month as "January" then the column <code>January</code> should be marked as <code>1</code> and the remaining of the newly added 11 columns should be <code>0</code>. I looked into pivot tables but that doesn't help my cause. Any ideas on how to do this in a simple elegant way?

I think you need <code>get_dummies</code>: <pre class="prettyprint"><code>df = pd.get_dummies(df['month']) </code></pre> And if need add new columns to original and remove <code>month</code> use <code>join</code> with <code>pop</code>: <pre class="prettyprint"><code>df2 = df.join(pd.get_dummies(df.pop('month'))) print (df2.head()) yyyy tmax tmin April August December February January July June \ 0 1908 5.0 -1.4 0 0 0 0 1 0 0 1 1908 7.3 1.9 0 0 0 1 0 0 0 2 1908 6.2 0.3 0 0 0 0 0 0 0 3 1908 7.4 2.1 1 0 0 0 0 0 0 4 1908 16.5 7.7 0 0 0 0 0 0 0 March May November October September 0 0 0 0 0 0 1 0 0 0 0 0 2 1 0 0 0 0 3 0 0 0 0 0 4 0 1 0 0 0 </code></pre> If NOT need remove column <code>month</code>: <pre class="prettyprint"><code>df2 = df.join(pd.get_dummies(df['month'])) print (df2.head()) yyyy month tmax tmin April August December February January \ 0 1908 January 5.0 -1.4 0 0 0 0 1 1 1908 February 7.3 1.9 0 0 0 1 0 2 1908 March 6.2 0.3 0 0 0 0 0 3 1908 April 7.4 2.1 1 0 0 0 0 4 1908 May 16.5 7.7 0 0 0 0 0 July June March May November October September 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 2 0 0 1 0 0 0 0 3 0 0 0 0 0 0 0 4 0 0 0 1 0 0 0 </code></pre> If need sort columns there is more possible solutions - use <code>reindex</code> or <code>reindex_axis</code>: <pre class="prettyprint"><code>months = ['January', 'February', 'March','April' ,'May', 'June', 'July', 'August', 'September','October', 'November','December'] </code></pre> <pre class="prettyprint"><code>df1 = pd.get_dummies(df['month']).reindex_axis(months, 1) print (df1.head()) January February March April May June July August September \ 0 1 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 2 0 0 1 0 0 0 0 0 0 3 0 0 0 1 0 0 0 0 0 4 0 0 0 0 1 0 0 0 0 October November December 0 0 0 0 1 0 0 0 2 0 0 0 3 0 0 0 4 0 0 0 df1 = pd.get_dummies(df['month']).reindex(columns=months) print (df1.head()) January February March April May June July August September \ 0 1 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 2 0 0 1 0 0 0 0 0 0 3 0 0 0 1 0 0 0 0 0 4 0 0 0 0 1 0 0 0 0 October November December 0 0 0 0 1 0 0 0 2 0 0 0 3 0 0 0 4 0 0 0 </code></pre> Or convert column <code>month</code> to ordered categorical: <pre class="prettyprint"><code>df1 = pd.get_dummies(df['month'].astype('category', categories=months, ordered=True)) print (df1.head()) January February March April May June July August September \ 0 1 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 2 0 0 1 0 0 0 0 0 0 3 0 0 0 1 0 0 0 0 0 4 0 0 0 0 1 0 0 0 0 October November December 0 0 0 0 1 0 0 0 2 0 0 0 3 0 0 0 4 0 0 0 </code></pre>

Pandas - Convert a categorical column to binary encoded form

Tags:

python

pandas

I have a dataset that looks like so -

     yyyy      month        tmax         tmin
0    1908    January         5.0         -1.4
1    1908   February         7.3          1.9
2    1908      March         6.2          0.3
3    1908      April         7.4          2.1
4    1908        May        16.5          7.7
5    1908       June        17.7          8.7
6    1908       July        20.1         11.0
7    1908     August        17.5          9.7
8    1908  September        16.3          8.4
9    1908    October        14.6          8.0
10   1908   November         9.6          3.4
11   1908   December         5.8         -0.3
12   1909    January         5.0          0.1
13   1909   February         5.5         -0.3
14   1909      March         5.6         -0.3
15   1909      April        12.2          3.3
16   1909        May        14.7          4.8
17   1909       June        15.0          7.5
18   1909       July        17.3         10.8
19   1909     August        18.8         10.7
20   1909  September        14.5          8.1
21   1909    October        12.9          6.9
22   1909   November         7.5          1.7
23   1909   December         5.3          0.4
24   1910    January         5.2         -0.5
...

It has four variables - yyyy, month, tmax(maximum temperature) and tmin

I want to use the month column as a variable while predictions and so want to convert it to its binary encoded version. Essentially, I want to add twelve variables to the dataset named January until December and if a particular row has month as "January" then the column January should be marked as 1 and the remaining of the newly added 11 columns should be 0.

I looked into pivot tables but that doesn't help my cause. Any ideas on how to do this in a simple elegant way?

450

asked Jul 31 '17 12:07

Clock Slave

Video Answer

2 Answers

I think you need get_dummies:

df = pd.get_dummies(df['month'])

And if need add new columns to original and remove month use join with pop:

df2 = df.join(pd.get_dummies(df.pop('month')))
print (df2.head())
   yyyy  tmax  tmin  April  August  December  February  January  July  June  \
0  1908   5.0  -1.4      0       0         0         0        1     0     0   
1  1908   7.3   1.9      0       0         0         1        0     0     0   
2  1908   6.2   0.3      0       0         0         0        0     0     0   
3  1908   7.4   2.1      1       0         0         0        0     0     0   
4  1908  16.5   7.7      0       0         0         0        0     0     0   

   March  May  November  October  September  
0      0    0         0        0          0  
1      0    0         0        0          0  
2      1    0         0        0          0  
3      0    0         0        0          0  
4      0    1         0        0          0

If NOT need remove column month:

df2 = df.join(pd.get_dummies(df['month']))
print (df2.head())
   yyyy     month  tmax  tmin  April  August  December  February  January  \
0  1908   January   5.0  -1.4      0       0         0         0        1   
1  1908  February   7.3   1.9      0       0         0         1        0   
2  1908     March   6.2   0.3      0       0         0         0        0   
3  1908     April   7.4   2.1      1       0         0         0        0   
4  1908       May  16.5   7.7      0       0         0         0        0   

   July  June  March  May  November  October  September  
0     0     0      0    0         0        0          0  
1     0     0      0    0         0        0          0  
2     0     0      1    0         0        0          0  
3     0     0      0    0         0        0          0  
4     0     0      0    1         0        0          0

If need sort columns there is more possible solutions - use reindex or reindex_axis:

months = ['January', 'February', 'March','April' ,'May',  'June', 'July', 'August', 'September','October', 'November','December']

df1 = pd.get_dummies(df['month']).reindex_axis(months, 1)
print (df1.head())
   January  February  March  April  May  June  July  August  September  \
0        1         0      0      0    0     0     0       0          0   
1        0         1      0      0    0     0     0       0          0   
2        0         0      1      0    0     0     0       0          0   
3        0         0      0      1    0     0     0       0          0   
4        0         0      0      0    1     0     0       0          0   

   October  November  December  
0        0         0         0  
1        0         0         0  
2        0         0         0  
3        0         0         0  
4        0         0         0  

df1 = pd.get_dummies(df['month']).reindex(columns=months)
print (df1.head())
   January  February  March  April  May  June  July  August  September  \
0        1         0      0      0    0     0     0       0          0   
1        0         1      0      0    0     0     0       0          0   
2        0         0      1      0    0     0     0       0          0   
3        0         0      0      1    0     0     0       0          0   
4        0         0      0      0    1     0     0       0          0   

   October  November  December  
0        0         0         0  
1        0         0         0  
2        0         0         0  
3        0         0         0  
4        0         0         0

Or convert column month to ordered categorical:

df1 = pd.get_dummies(df['month'].astype('category', categories=months, ordered=True))
print (df1.head())
   January  February  March  April  May  June  July  August  September  \
0        1         0      0      0    0     0     0       0          0   
1        0         1      0      0    0     0     0       0          0   
2        0         0      1      0    0     0     0       0          0   
3        0         0      0      1    0     0     0       0          0   
4        0         0      0      0    1     0     0       0          0   

   October  November  December  
0        0         0         0  
1        0         0         0  
2        0         0         0  
3        0         0         0  
4        0         0         0

176

answered Nov 14 '22 21:11

jezrael

IIUC,

You can use assign, ** unpacking operator, and pd.get_dummies:

df.assign(**pd.get_dummies(df['month']))

Output:

    yyyy      month  tmax  tmin  April  August  December  February  January  \
0   1908    January   5.0  -1.4      0       0         0         0        1   
1   1908   February   7.3   1.9      0       0         0         1        0   
2   1908      March   6.2   0.3      0       0         0         0        0   
3   1908      April   7.4   2.1      1       0         0         0        0   
4   1908        May  16.5   7.7      0       0         0         0        0   
5   1908       June  17.7   8.7      0       0         0         0        0   
6   1908       July  20.1  11.0      0       0         0         0        0   
7   1908     August  17.5   9.7      0       1         0         0        0   
8   1908  September  16.3   8.4      0       0         0         0        0   
9   1908    October  14.6   8.0      0       0         0         0        0   
10  1908   November   9.6   3.4      0       0         0         0        0   
11  1908   December   5.8  -0.3      0       0         1         0        0   
12  1909    January   5.0   0.1      0       0         0         0        1   
13  1909   February   5.5  -0.3      0       0         0         1        0   
14  1909      March   5.6  -0.3      0       0         0         0        0   
15  1909      April  12.2   3.3      1       0         0         0        0   
16  1909        May  14.7   4.8      0       0         0         0        0   
17  1909       June  15.0   7.5      0       0         0         0        0   
18  1909       July  17.3  10.8      0       0         0         0        0   
19  1909     August  18.8  10.7      0       1         0         0        0   
20  1909  September  14.5   8.1      0       0         0         0        0   
21  1909    October  12.9   6.9      0       0         0         0        0   
22  1909   November   7.5   1.7      0       0         0         0        0   
23  1909   December   5.3   0.4      0       0         1         0        0   
24  1910    January   5.2  -0.5      0       0         0         0        1   

    July  June  March  May  November  October  September  
0      0     0      0    0         0        0          0  
1      0     0      0    0         0        0          0  
2      0     0      1    0         0        0          0  
3      0     0      0    0         0        0          0  
4      0     0      0    1         0        0          0  
5      0     1      0    0         0        0          0  
6      1     0      0    0         0        0          0  
7      0     0      0    0         0        0          0  
8      0     0      0    0         0        0          1  
9      0     0      0    0         0        1          0  
10     0     0      0    0         1        0          0  
11     0     0      0    0         0        0          0  
12     0     0      0    0         0        0          0  
13     0     0      0    0         0        0          0  
14     0     0      1    0         0        0          0  
15     0     0      0    0         0        0          0  
16     0     0      0    1         0        0          0  
17     0     1      0    0         0        0          0  
18     1     0      0    0         0        0          0  
19     0     0      0    0         0        0          0  
20     0     0      0    0         0        0          1  
21     0     0      0    0         0        1          0  
22     0     0      0    0         1        0          0  
23     0     0      0    0         0        0          0  
24     0     0      0    0         0        0          0

answered Nov 14 '22 22:11

Scott Boston

Related questions
                            
                                Count of each unique element in a list [duplicate]
                            
                                NotImplementedError: Use module Crypto.Cipher.PKCS1_OAEP instead error
                            
                                Printing one color using imshow [closed]
                            
                                Pause and resume thread in python
                            
                                Python3: Reportlab Image - ResourceWarning: unclosed file <_io.BufferedReader name=...>
                            
                                Opencv-python: Type of input image should be CV_8UC3 or CV_8UC4! in function fastNlMeansDenoisingColored
                            
                                Ignore string columns while doing
                            
                                Pillow and JPEG2000: decoder jpeg2k not available
                            
                                How do I add a column to an existing excel file using python?
                            
                                Trying to import pypyodbc module gives error 'ODBC Library is not found. Is LD_LIBRARY_PATH set?'
                            
                                Get close matches for multiple words in a dictionary
                            
                                Pandas difference between apply() and aggregate() functions
                            
                                Tensorflow: Merge two 2-D tensors according to even and odd indices
                            
                                How is `getattr` related to `object.__getattribute__` and to `object.__getattr__`?
                            
                                Point Cloud to Volume [closed]
                            
                                python - passing a list of dict in javascript Flask
                            
                                How to plot two list in the same graph, but with different colors?
                            
                                Numpy fusing multiply and add to avoid wasting memory
                            
                                Python 3.6 pyodbc to SQL How to execute SP
                            
                                Django filter with OR statement

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With