I have a dataset that looks like so -
yyyy month tmax tmin
0 1908 January 5.0 -1.4
1 1908 February 7.3 1.9
2 1908 March 6.2 0.3
3 1908 April 7.4 2.1
4 1908 May 16.5 7.7
5 1908 June 17.7 8.7
6 1908 July 20.1 11.0
7 1908 August 17.5 9.7
8 1908 September 16.3 8.4
9 1908 October 14.6 8.0
10 1908 November 9.6 3.4
11 1908 December 5.8 -0.3
12 1909 January 5.0 0.1
13 1909 February 5.5 -0.3
14 1909 March 5.6 -0.3
15 1909 April 12.2 3.3
16 1909 May 14.7 4.8
17 1909 June 15.0 7.5
18 1909 July 17.3 10.8
19 1909 August 18.8 10.7
20 1909 September 14.5 8.1
21 1909 October 12.9 6.9
22 1909 November 7.5 1.7
23 1909 December 5.3 0.4
24 1910 January 5.2 -0.5
...
It has four variables - yyyy
, month
, tmax
(maximum temperature) and tmin
I want to use the month column as a variable while predictions and so want to convert it to its binary encoded version. Essentially, I want to add twelve variables to the dataset named January
until December
and if a particular row has month as "January" then the column January
should be marked as 1
and the remaining of the newly added 11 columns should be 0
.
I looked into pivot tables but that doesn't help my cause. Any ideas on how to do this in a simple elegant way?
We will be using . LabelEncoder() from sklearn library to convert categorical data to numerical data. We will use function fit_transform() in the process.
To convert your categorical variables to dummy variables in Python you c an use Pandas get_dummies() method. For example, if you have the categorical variable “Gender” in your dataframe called “df” you can use the following code to make dummy variables: df_dc = pd. get_dummies(df, columns=['Gender']) .
In Python, you can simply use the bin() function to convert from a decimal value to its corresponding binary value. And similarly, the int() function to convert a binary to its decimal value. The int() function takes as second argument the base of the number to be converted, which is 2 in case of binary numbers.
I think you need get_dummies
:
df = pd.get_dummies(df['month'])
And if need add new columns to original and remove month
use join
with pop
:
df2 = df.join(pd.get_dummies(df.pop('month')))
print (df2.head())
yyyy tmax tmin April August December February January July June \
0 1908 5.0 -1.4 0 0 0 0 1 0 0
1 1908 7.3 1.9 0 0 0 1 0 0 0
2 1908 6.2 0.3 0 0 0 0 0 0 0
3 1908 7.4 2.1 1 0 0 0 0 0 0
4 1908 16.5 7.7 0 0 0 0 0 0 0
March May November October September
0 0 0 0 0 0
1 0 0 0 0 0
2 1 0 0 0 0
3 0 0 0 0 0
4 0 1 0 0 0
If NOT need remove column month
:
df2 = df.join(pd.get_dummies(df['month']))
print (df2.head())
yyyy month tmax tmin April August December February January \
0 1908 January 5.0 -1.4 0 0 0 0 1
1 1908 February 7.3 1.9 0 0 0 1 0
2 1908 March 6.2 0.3 0 0 0 0 0
3 1908 April 7.4 2.1 1 0 0 0 0
4 1908 May 16.5 7.7 0 0 0 0 0
July June March May November October September
0 0 0 0 0 0 0 0
1 0 0 0 0 0 0 0
2 0 0 1 0 0 0 0
3 0 0 0 0 0 0 0
4 0 0 0 1 0 0 0
If need sort columns there is more possible solutions - use reindex
or reindex_axis
:
months = ['January', 'February', 'March','April' ,'May', 'June', 'July', 'August', 'September','October', 'November','December']
df1 = pd.get_dummies(df['month']).reindex_axis(months, 1)
print (df1.head())
January February March April May June July August September \
0 1 0 0 0 0 0 0 0 0
1 0 1 0 0 0 0 0 0 0
2 0 0 1 0 0 0 0 0 0
3 0 0 0 1 0 0 0 0 0
4 0 0 0 0 1 0 0 0 0
October November December
0 0 0 0
1 0 0 0
2 0 0 0
3 0 0 0
4 0 0 0
df1 = pd.get_dummies(df['month']).reindex(columns=months)
print (df1.head())
January February March April May June July August September \
0 1 0 0 0 0 0 0 0 0
1 0 1 0 0 0 0 0 0 0
2 0 0 1 0 0 0 0 0 0
3 0 0 0 1 0 0 0 0 0
4 0 0 0 0 1 0 0 0 0
October November December
0 0 0 0
1 0 0 0
2 0 0 0
3 0 0 0
4 0 0 0
Or convert column month
to ordered categorical:
df1 = pd.get_dummies(df['month'].astype('category', categories=months, ordered=True))
print (df1.head())
January February March April May June July August September \
0 1 0 0 0 0 0 0 0 0
1 0 1 0 0 0 0 0 0 0
2 0 0 1 0 0 0 0 0 0
3 0 0 0 1 0 0 0 0 0
4 0 0 0 0 1 0 0 0 0
October November December
0 0 0 0
1 0 0 0
2 0 0 0
3 0 0 0
4 0 0 0
IIUC,
You can use assign
, **
unpacking operator, and pd.get_dummies
:
df.assign(**pd.get_dummies(df['month']))
Output:
yyyy month tmax tmin April August December February January \
0 1908 January 5.0 -1.4 0 0 0 0 1
1 1908 February 7.3 1.9 0 0 0 1 0
2 1908 March 6.2 0.3 0 0 0 0 0
3 1908 April 7.4 2.1 1 0 0 0 0
4 1908 May 16.5 7.7 0 0 0 0 0
5 1908 June 17.7 8.7 0 0 0 0 0
6 1908 July 20.1 11.0 0 0 0 0 0
7 1908 August 17.5 9.7 0 1 0 0 0
8 1908 September 16.3 8.4 0 0 0 0 0
9 1908 October 14.6 8.0 0 0 0 0 0
10 1908 November 9.6 3.4 0 0 0 0 0
11 1908 December 5.8 -0.3 0 0 1 0 0
12 1909 January 5.0 0.1 0 0 0 0 1
13 1909 February 5.5 -0.3 0 0 0 1 0
14 1909 March 5.6 -0.3 0 0 0 0 0
15 1909 April 12.2 3.3 1 0 0 0 0
16 1909 May 14.7 4.8 0 0 0 0 0
17 1909 June 15.0 7.5 0 0 0 0 0
18 1909 July 17.3 10.8 0 0 0 0 0
19 1909 August 18.8 10.7 0 1 0 0 0
20 1909 September 14.5 8.1 0 0 0 0 0
21 1909 October 12.9 6.9 0 0 0 0 0
22 1909 November 7.5 1.7 0 0 0 0 0
23 1909 December 5.3 0.4 0 0 1 0 0
24 1910 January 5.2 -0.5 0 0 0 0 1
July June March May November October September
0 0 0 0 0 0 0 0
1 0 0 0 0 0 0 0
2 0 0 1 0 0 0 0
3 0 0 0 0 0 0 0
4 0 0 0 1 0 0 0
5 0 1 0 0 0 0 0
6 1 0 0 0 0 0 0
7 0 0 0 0 0 0 0
8 0 0 0 0 0 0 1
9 0 0 0 0 0 1 0
10 0 0 0 0 1 0 0
11 0 0 0 0 0 0 0
12 0 0 0 0 0 0 0
13 0 0 0 0 0 0 0
14 0 0 1 0 0 0 0
15 0 0 0 0 0 0 0
16 0 0 0 1 0 0 0
17 0 1 0 0 0 0 0
18 1 0 0 0 0 0 0
19 0 0 0 0 0 0 0
20 0 0 0 0 0 0 1
21 0 0 0 0 0 1 0
22 0 0 0 0 1 0 0
23 0 0 0 0 0 0 0
24 0 0 0 0 0 0 0
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With